Friday, December 2, 2022 1:00pm to 1:30pm
About this Event
Abstract: A common paradigm in deep learning applications for computer vision is self-supervised pretraining followedby supervised fine-tuning on a target task. In the self-supervision step, a model is trained in a supervised fashion, but the source of supervision needs to be implicitly defined by the data. Image-caption alignment is often used as such a source of implicit supervision in multimodal pretraining,and grounding (i.e., matching word tokens with visual to-kens) is one way to exploit it. We introduce a strategy to take advantage of an underexplored structure in image-caption datasets: the relationship between captions matched with different images but mentioning the same objects. Given an image-caption pair, we find an additional caption that mentions one of the objects the first caption mentions, and we impose a sparse grounding between the image and the second caption so that only a few word tokens are grounded in the image. Our goal is to learn a better feature representation for the objects mentioned by both captions, encouraging grounding between the additional caption and the image to focus on the common objects only. We report superior grounding performance when comparing our approach with a previously-published pretraining strategy, and we show the benefit of our proposed double-caption grounding on two downstream detection tasks: supervised detection and open-vocabulary detection.
Bio: Giacomo Nebbia is a PhD student in the Intelligent Systems Program. His research interests include AI Applied to the Clinical Practice and Clinical Decision Support Systems.
RSVP for Zoom Meeting Information: https://pitt.co1.qualtrics.com/jfe/form/SV_cHZ8SndLoF22hLw
Please let us know if you require an accommodation in order to participate in this event. Accommodations may include live captioning, ASL interpreters, and/or captioned media and accessible documents from recorded events. At least 5 days in advance is recommended.