Welcome to the ELLIS Multimodal Learning Systems Workshop on:

Multimodal Foundation Models 

Organizers: Cees Snoek  (University of Amsterdam) & Nicu Sebe (University of Trento)

January 17-19, 2024
Mathematisches Forschungsinstitut Oberwolfach (MFO) - Oberwolfach-Walke - Germany
Attendance by invitation only.


Multimodal foundation models are a revolutionary class of AI models that provides impressive abilities to generate content (text, images, sound, videos, protein structures, and more), and do so by interactive prompts in a seemingly creative manner. These foundation models are often autoregressive, self-supervised, transformer-based models that are pre-trained on large volumes of data, typically collected from the web. They already form the basis of all state-of-the-art systems in computer vision and natural language processing across a wide range of tasks and have shown impressive few-shot learning abilities. The perceived intelligence and adaptability of models like ChatGPT, Stable Diffusion, Gemini, and GPT4 impress, but their aptitude to produce inaccurate, misleading, or false information (and present it confidently and convincingly) makes them unsuitable for any task of importance and poses serious societal concerns. In this workshop we present recent advances on multimodal foundation models from academia and industry and discuss their impact and implications moving forward.


The workshop is hosted at the Mathematisches Forschungsinstitut Oberwolfach (MFO) . Accommodation and meals will take place in Hotel Hirschen, which is within walking distance of the MFO.

Mathematisches Forschungsinstitut Oberwolfach (MFO) Hotel Hirschen
Schwarzwaldstraße 9-11 Schwarzwaldstraße 2-3
77709 Oberwolfach-Walke, Germany 77709 Oberwolfach-Walke, Germany
Phone: +49 (0) 7834 979-0 Phone: +49 (0) 7834 837-0
Email: admin@mfo.de Email: info@hotel-hirschen-oberwolfach.de


Wednesday January 17

15:00 - 19:00 Arrival of attendees, time for socializing and discussion
19:00 - 22:00 Opening dinner at Hotel Hirschen

Thursday January 18

08:00 - 09:30 Breakfast at Hotel Hirschen
09:30 - 11:00  Morning Session I: Foundation Models (Chair: Yiannis Kompatsiaris)
FoMO without FOMO by Karteek Alahari (Inria)    
Towards 3D Human Foundation Models by Cristian Sminchisescu (Google)
What multimodal foundation models cannot perceive by Cees Snoek (University of Amsterdam)
Are Foundation Models the tool for Social Embodied AI? by Xavier Alameda-Pineda (Inria)
11:00 - 11:30  Coffee break
11:30 - 12:30  Morning Session II: Vision & Language (Chair: Dima Damen)
Vocabulary-free Image Classification by Elisa Ricci (University of Trento)
Coreference resolution in narrated images by Hakan Bilen (The University of Edinburgh)
Vision-Language Self-Supervised Learning by Shaogang Gong (Queen Mary, University of London)
12:30 - 14:00 Lunch break at Hotel Hirschen
14:00 - 15:30  Afternoon Session I: Generative AI  (Chair: Karteek Alahari)
Images & text: alignment, generation and compression by Jakob Verbeek (Meta)
Measuring the Quality of Generative Neural Networks - An Unsolved Problem by Juergen Gall (University of Bonn)
Controllable generation for Analysis and Synthesis by Ioannis Patras (Queen Mary, University of London)
Improving Fairness using Vision-Language Driven Image Augmentation by Nicu Sebe (University of Trento)
15:30 - 16:00  Coffee break
16:00 - 17:00  Afternoon Session II: Multimodality (Chair: Xavier Alameda-Pineda)
Multi-modality in Egocentric Vision - Contradictory and complementary signals by Dima Damen (University of Bristol)
Multimodal LLMs for Document Understanding by Dimosthenis Karatzas (Universitat Autónoma de Barcelona)
Large Multimodal Models for Media and Journalism by Yiannis Kompatsiaris (Information Technologies Institute, CERTH)
17:00 - 18:30  Discussion Session (Chairs: Cees Snoek & Nicu Sebe)
19:00 - 22:00 Dinner at Hotel Hirschen

Friday January 19

08:00 - 10:00 Breakfast at Hotel Hirschen
10:00 onwards Departure