Multimodal Language Understanding

Course Description

Multimodal Language Understanding aims to use information from different sources such as text, speech, images, and gestures, to enhance language processing tasks. As we naturally use multiple forms of communication in our daily interactions, enabling machines to do the same enhances their understanding of human communication. For example, sentiment analysis can be improved by incorporating tone of voice or facial expressions alongside text. In this class, we will explore techniques for modeling multiple modalities, identify tasks that benefit from multimodal input, and discuss the challenges when handling multiple modalities.

Prerequisites

This course will include reading, writing, and discussion and is intended for students from Computer Science, Linguistics, and related areas. Knowledge in AI is required, including having taken introductory courses in AI, ML or NLP.

Feel free to email at [email protected] if you have any questions.

Time and Location

Thursdays (starting Oct 17th) 8:30 am - 10:00 am, C7 2, room -1.05

Sign up

Please apply via the SIC seminar website if possible (mainly for CS students).
For LST students, please send an email to indicate interest [email protected]

In both cases, we highly recommended you to attend the first introduction meeting on Oct 17 8:30am to know more about the course and for topic assignment.

Course Format

(17/10 8:30 am) Introduction. Link to slides

(24/10 8:30 am) How to read an academic paper Link to slides

(31/10 8:30 am) Research talk. Link to slides

Further meetings: Discussion of two papers by students (20 min for presentation + 10-15 min for discussion)

Text-to-Motion Synthesis
1. Tevet, Guy, et al. "Motionclip: Exposing human motion generation to clip space." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.
2. Petrovich, Mathis, Michael J. Black, and Gül Varol. "TEMOS: Generating diverse human motions from textual descriptions." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.
3. Jiang, Biao, et al. "Motiongpt: Human motion as a foreign language." Advances in Neural Information Processing Systems 36 (2023): 20067-20079.
4. Feng, Yao, et al. "Chatpose: Chatting about 3d human pose." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
Gesture-enhanced Language Modelling
1. Abzaliev, Artem, Andrew Owens, and Rada Mihalcea. "Towards understanding the relation between gestures and language." Proceedings of the 29th International Conference on Computational Linguistics. 2022.
2. Maidment, Tristan, et al. "Using Language-Aligned Gesture Embeddings for Understanding Gestures Accompanying Math Terms." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
3. Xu, Yang, and Yang Cheng. "Spontaneous gestures encoded by hand positions improve language models: An Information-Theoretic motivated study." Findings of the Association for Computational Linguistics: ACL 2023. 2023.
Vision-Langauge Pre-training
1. Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
2. Wei, Zihao, Zixuan Pan, and Andrew Owens. "Efficient Vision-Language Pre-training by Cluster Masking." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024
3. Wang, Zirui, et al. "SimVLM: Simple Visual Language Model Pretraining with Weak Supervision." International Conference on Learning Representations.
4. Li, Junnan, et al. "Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models." International conference on machine learning. PMLR, 2023.
5. Tsimpoukelli, Maria, et al. "Multimodal few-shot learning with frozen language models." Advances in Neural Information Processing Systems 34 (2021): 200-212.
6. Merullo, Jack, et al. "Linearly Mapping from Image to Text Space." The Eleventh International Conference on Learning Representations.
Visual Instruction Tuning
1. Liu, Haotian, et al. "Visual instruction tuning." Advances in neural information processing systems 36 (2024).
2. Dai et al. “InstructBLIP: Towards General-purpose Vision-Language Models with Instruction tuning”, International Conference on Learning Representations.
3. Ye, Qinghao, et al. "mplug-owl: Modularization empowers large language models with multimodality." arXiv preprint arXiv:2304.14178 (2023).