Course Description

Multimodal Language Understanding aims to use information from different sources such as text, speech, images, and gestures, to enhance language processing tasks. As we naturally use multiple forms of communication in our daily interactions, enabling machines to do the same enhances their understanding of human communication. For example, sentiment analysis can be improved by incorporating tone of voice or facial expressions alongside text. In this class, we will explore techniques for modeling multiple modalities, identify tasks that benefit from multimodal input, and discuss the challenges when handling multiple modalities.

Prerequisites

This course will include reading, writing, and discussion and is intended for students from Computer Science, Linguistics, and related areas. Knowledge in AI is required, including having taken introductory courses in AI, ML or NLP.

Feel free to email at [email protected] if you have any questions.

Time and Location

Building C7 3 - Seminar room 1.14, Mondays 8:30-10:00, First class starts on 20th April

Sign up

  1. Please apply via the SIC seminar website if possible (mainly for CS students).
  2. For LST students, please send an email to indicate interest [email protected]

Course Format

Summer 2026 Schedule

Tentative date Topics/Agenda of Discussion Tentative Papers
20/04 Introduction Class
27/04 PREPERATION (No seminar) To send top three preferences of topic via email by 24/04. This counts as final registration. Total slots is 16. Applicants via the SIC seminar system should also send this. Will be notified via email regarding assigned topic and papers by 27/04.
04/05-11/05 PREPERATION (No seminar) Mandatory Feedback Meetings for Presenters until 08/06 to be scheduled during this week
18/05 The Bag-of-Words Problem in Vision 1. When and why vision-language models behave like bags-of-words, and what to do about it?
  1. CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally | | 1/06 | Perception vs Understanding Gap | 1. Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
  2. SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality | | 25/05 | WHIT MONDAY | | | 08/06 | Perception vs Understanding Gap | 1. Relational Visual Similarity
  3. ConMe: Rethinking Evaluation of CompositionalReasoning for Modern VLMs | | 15/06 | Poor grounding causes reasoning failures | 1. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, Compositional Chain-of-Thought Prompting for Large Multimodal Models. 2. Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? | | 22/06 | Models guess and take shortcuts | 1. MMStar: Are We on the Right Way for Evaluating Large Vision-Language Models? 2. Arbitration Failure, Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts | | 29/06 | Intervening | 1. Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
  4. From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning | | 06/07 | No Seminar on this day | | | 13/07 | Fixing the Root: Representation Learning | 1. VL-JEPA: Joint Embedding Predictive Architecture for Vision-language
  5. Concept Bottleneck Model |

Grading breakdown

4 credits • 10% - Attendance and participation: Attendance to all talks and active participation in class • 20% - Weekly Quesions: Send via email, Moderation (Moderators will get the questions sent and you have to collate and chair the discussion by bringing out main questions to presenter) • 70% - Paper Presentation:

Paper Presentation: Lead the discussion on the assigned paper.

7 Credits • 10% - Attendance and participation:(as above) • 20% - Weekly Questions: Send via email, Moderation • 30% - Paper Presentation (as above) • 40% - Hands-on Implementation and Writeup (6-8 pages)

Hands-on Implementation and Writeup: