Friday, April 11, 2025 12:30pm to 2:00pm
About this Event
210 South Bouquet Street, Pittsburgh, PA 15260
https://www.cs.pitt.edu/news/april-11-colloquium-towards-trustworthy-and-reliable-multimodal-aiAbstract: Multimodal AI systems are transforming how users interact with digital content. However, deploying these models in high-stakes settings requires ensuring both the trustworthiness of the information they process and generate, as well as the reliability of their learning and inference mechanisms. In this talk, I present approaches to addressing these challenges. First, I introduce methods for ensuring trustworthy information in multimodal settings. Making fine-grained predictions about cross-modal inconsistencies is difficult due to the lack of labeled data and the complex reasoning required to assess how specific claims relate to other modalities, documents, and background knowledge. I first present a method for detecting machine-generated misinformation by identifying subtle inconsistencies between text and images. I then introduce a fine-grained reasoning task that predicts how textual claims relate to visual evidence. Finally, I describe a method for fine-grained cross-document, cross-media logical entailment, which decomposes claims into sub-claims and evaluates each against diverse sources. These techniques enable models to verify assertions against multimodal evidence for many tasks, including improving trustworthiness, mitigating hallucinations, and performing fact-checking and disinformation detection.
Next, I address the reliability of multimodal models under adversarial conditions. Large-scale vision-language models trained on web-scale data remain vulnerable to adversarial poisoning. I first present a defense mechanism that protects against attacks on contrastive training by using fine-grained knowledge alignment to guide model attention. I then characterize the risks posed by the unique attack surface of the newest class of multimodal large language models through novel universal adversarial perturbations. Finally, I touch on our recent work on model immunization and new types of adversarial attacks on multimodal large language models.
Bio: I am an Assistant Professor in the Department of Computer Science at Virginia Tech. My research is at the intersection of computer vision, natural language processing, and multimedia. I am interested in many problems requiring reasoning across multimodal data, including cross-modal retrieval, information extraction, and knowledge representation. I am associated with the Sanghani Center for Artificial Intelligence and Data Analytics. Prior to joining Virginia Tech, I was a postdoctoral researcher at Columbia University working with Professor Shih-Fu Chang.
Website: https://people.cs.vt.edu/chris/
Please let us know if you require an accommodation in order to participate in this event. Accommodations may include live captioning, ASL interpreters, and/or captioned media and accessible documents from recorded events. At least 5 days in advance is recommended.