Friday, April 8, 2022 12:30pm to 1:00pm
About this Event
Abstract: We tackle the problem of learning object detectors in a noisy environment, which is one of the significant challenges for weakly-supervised learning. We use multimodal learning to help localize objects of interest, but unlike other methods, we treat audio as an auxiliary modality that assists to tackle noise in detection from visual regions. First, we use the audio model to generate new "ground-truth'' labels for the training set to remove noise between the visual features and noisy supervision. Second, we propose an "indirect path" between audio and class predictions, which combines the link between visual and audio regions, and the link between visual features and predictions, and improves object classification. Third, we propose a sound-based ``attention path'' which uses the benefit of complementary audio cues to identify important visual regions, and boosts object classification and detection performance. We use contrastive learning in our framework that performs region-based audio-visual instance discrimination to incorporate information from both audio and video frames, and capture relationships between audio and visual regions. We show our methods which use sound to update noisy ground-truth, and to provide an indirect path and attention path, greatly boost performance on the AudioSet and VGGSound datasets compared to single-modality predictions, even when contrastive learning is used.
Bio: Cagri Gungor is a PhD student in the Intelligent Systems Program. His research interests include computer vision, machine learning, object detection, multimodality, and weak supervision.
RSVP for the Zoom meeting information: https://pitt.co1.qualtrics.com/jfe/form/SV_7WLh7jAwMyaBuGW
Please let us know if you require an accommodation in order to participate in this event. Accommodations may include live captioning, ASL interpreters, and/or captioned media and accessible documents from recorded events. At least 5 days in advance is recommended.