Course Description

Multimodal Language Understanding aims to use information from different sources such as text, speech, images, and gestures, to enhance language processing tasks. As we naturally use multiple forms of communication in our daily interactions, enabling machines to do the same enhances their understanding of human communication. For example, sentiment analysis can be improved by incorporating tone of voice or facial expressions alongside text. In this class, we will explore techniques for modeling multiple modalities, identify tasks that benefit from multimodal input, and discuss the challenges when handling multiple modalities.

Prerequisites

This course will include reading, writing, and discussion and is intended for students from Computer Science, Linguistics, and related areas. Knowledge in AI is required, including having taken introductory courses in AI, ML or NLP.

Feel free to email at [email protected] if you have any questions.

Course Format

1st meeting: Introduction + Paper assignment for further meetings

2nd meeting: Paper discussion, Organisational Discussion

Further meetings: Discussion of two papers by students (20 min for presentation + 10-15 min for discussion)

List of papers: (Still in progress)

Language + Gestures (~10 papers)

  1. Gesture Synthesis
    1. MotionGPT: Human Motion as a Foreign Language
    2. ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis
    3. Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis
  2. Language Modelling
    1. Spontaneous gestures encoded by hand positions can improve language models: An Information-Theoretic motivated study
    2. Towards Understanding the Relation between Gestures and Language
    3. Using Language-Aligned Gesture Embeddings for Understanding Gestures Accompanying Math Terms

Language + Speech (~8 papers)

  1. Textually Pretrained Speech Language Models
  2. DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment
  3. Toward Joint Language Modeling for Speech Units and Text

Language + Image (~6 papers)

  1. Learning Transferable Visual Models From Natural Language Supervision
  2. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
  3. Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasks