Are you interested in performing high-impact artificial intelligence research on embodied foundation models that will enable an autonomous robot to operate in an open world? 

Progress in multimodal foundation models has been astonishing in the past few years and allow to equip robots with world knowledge of scenes, objects, and human activities. Robots should then be able to perceive and act upon the sensed world, be it that current solutions require data diversity, task circumstances, and the label vocabulary all to be pre-defined, stationary and controlled. As soon as these ‘closed world’ deep learning assumptions are broken, perceptual understanding suffers and oftentimes catastrophically. Hence, robots equipped with state-of-the-art multimodal perceptual skills will experience great difficulty generalizing to perception tasks in an open world where sensory and semantic conditions will differ considerably from those perceived during training. 

Our key research question is: How to enable multimodal perception for robots that is robust to sensory and semantic shifts between training and operation conditions?

You will carry out research and development in the areas of embodied foundation models, deep machine learning and computer vision. Topics of interest are test-time generalization, embodied grounding, data scarcity and uncertainly modeling. The research is embedded in the VIS lab group at the University of Amsterdam, and you will actively collaborate within the OpenBots lab, that contains a team of five PhD students, two at the University of Amsterdam (this vacancy) and three at Delft University of Technology (focusing on planning and control). You will work three days a week at the University of Amsterdam and the other two days you will work with all other four PhDs at TNO and the Royal Netherlands Marechaussee where a physical lab environment with several land-robots is available. The project is carried out with supervisors from the Video and Image Sense Lab (Amsterdam) and the Cognitive Robotics Group (Delft). Students at UvA will be supervised by prof. dr. Cees Snoek and dr. ir. Gertjan Burghouts (TNO).

For more details on the vacancies pls check: https://vacatures.uva.nl/UvA/job/Two-PhD-Positions-on-Embodied-Foundation-Models/797275802/

Foundation Models, and their origin, analysis and development have been typically associated with the US and Big Tech. Yet, a critical share of important insights and novel approaches do come from Europe, both within academia and industry. Part of this winter school’s goal is to highlight these fresh perspectives and give the students an in-depth look into how Europe is guiding its own research agenda with unique directions and bringing together the community. The winter school will take place at the University of Amsterdam. For full program see: https://amsterdam-fomo.github.io.

The NeurIPS, 2023 cam-ready paper Learn to Categorize or Categorize to Learn? Self-Coding for Generalized Category Discovery by Sarah Rastegar, Hazel Doughty, Cees G M Snoek is now available. In the quest for unveiling novel categories at test time, we confront the inherent limitations of traditional supervised recognition models that are restricted by a predefined category set. While strides have been made in the realms of self-supervised and open-world learning towards test-time category discovery, a crucial yet often overlooked question persists: what exactly delineates a category? In this paper, we conceptualize a category through the lens of optimization, viewing it as an optimal solution to a well-defined problem. Harnessing this unique conceptualization, we propose a novel, efficient and self-supervised method capable of discovering previously unknown categories at test time. A salient feature of our approach is the assignment of minimum length category codes to individual data instances, which encapsulates the implicit category hierarchy prevalent in real-world datasets. This mechanism affords us enhanced control over category granularity, thereby equipping our model to handle fine-grained categories adeptly. Experimental evaluations, bolstered by state-of-the-art benchmark comparisons, testify to the efficacy of our solution in managing unknown categories at test time. Furthermore, we fortify our proposition with a theoretical foundation, providing proof of its optimality.

The NeurIPS, 2023 cam-ready paper Learning Unseen Modality Interaction by Yunhua Zhang, Hazel Doughty, Cees G M Snoek is now available. Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved. This allows the information to be accumulated with a simple summation operation across available modalities. To reduce overfitting to less discriminative modality combinations during training, we further improve the model learning with pseudo-supervision indicating the reliability of a modality’s prediction. We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval.

The NeurIPS2023 cam-ready ProtoDiff: Learning to Learn Prototypical Networks by Task-Guided Diffusion by Yingjun Du, Zehao Xiao, Shengcai Liao, Cees G M Snoek is now available. Prototype-based meta-learning has emerged as a powerful technique for addressing few-shot learning challenges. However, estimating a deterministic prototype using a simple average function from a limited number of examples remains a fragile process. To overcome this limitation, we introduce ProtoDiff, a novel framework that leverages a task-guided diffusion model during the meta-training phase to gradually generate prototypes, thereby providing efficient class representations. Specifically, a set of prototypes is optimized to achieve per-task prototype overfitting, enabling accurately obtaining the overfitted prototypes for individual tasks. Furthermore, we introduce a task-guided diffusion process within the prototype space, enabling the meta-learning of a generative process that transitions from a vanilla prototype to an overfitted prototype. ProtoDiff gradually generates task-specific prototypes from random noise during the meta-test stage, conditioned on the limited samples available for the new task. Furthermore, to expedite training and enhance ProtoDiff’s performance, we propose the utilization of residual prototype learning, which leverages the sparsity of the residual prototype. We conduct thorough ablation studies to demonstrate its ability to accurately capture the underlying prototype distribution and enhance generalization. The new state-of-the-art performance on within-domain, cross-domain, and few-task few-shot classification further substantiates the benefit of ProtoDiff.

The ICCV 2023 paper Bayesian Prompt Learning for Image-Language Model Generalization by Mohammad Mahdi Derakhshani, Enrique Sanchez, Adrian Bulat, Victor Guilherme Turrisi da Costa, Cees G M Snoek, Georgios Tzimiropoulos, Brais Martinez is now available. Foundational image-language models have generated considerable interest due to their efficient adaptation to downstream tasks by prompt learning. Prompt learning treats part of the language model input as trainable while freezing the rest, and optimizes an Empirical Risk Minimization objective. However, Empirical Risk Minimization is known to suffer from distributional shifts which hurt generalizability to prompts unseen during training. By leveraging the regularization ability of Bayesian methods, we frame prompt learning from the Bayesian perspective and formulate it as a variational inference problem. Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts. Our framework is implemented by modeling the input prompt space in a probabilistic manner, as an a priori distribution which makes our proposal compatible with prompt learning approaches that are unconditional or conditional on the image. We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space, prevents learning spurious features, and exploits transferable invariant features. This results in better generalization of unseen prompts, even across different datasets and domains.

The ICCV 2023 paper Detecting Objects with Graph Priors and Graph Refinement by Aritra Bhowmik, Martin R Oswald, Yu Wang, Nora Baka, Cees G M Snoek is now available. The goal of this paper is to detect objects by exploiting their interrelationships. Rather than relying on predefined and labeled graph structures, we infer a graph prior from object co-occurrence statistics. The key idea of our paper is to model object relations as a function of initial class predictions and co-occurrence priors to generate a graph representation of an image for improved classification and bounding box regression. We additionally learn the object-relation joint distribution via energy based modeling. Sampling from this distribution generates a refined graph representation of the image which in turn produces improved detection performance. Experiments on the Visual Genome and MS-COCO datasets demonstrate our method is detector agnostic, end-to-end trainable, and especially beneficial for rare object classes. What is more, we establish a consistent improvement over object detectors like DETR and Faster-RCNN, as well as state-of-the-art methods modeling object interrelationships.

The ICCV 2023 paper Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization by Fida Mohammad Thoker, Hazel Doughty, Cees G M Snoek is now available. We propose a self-supervised method for learning motion-focused video representations. Existing approaches minimize distances between temporally augmented videos, which maintain high spatial similarity. We instead propose to learn similarities between videos with identical local motion dynamics but an otherwise different appearance. We do so by adding synthetic motion trajectories to videos which we refer to as tubelets. By simulating different tubelet motions and applying transformations, such as scaling and rotation, we introduce motion patterns beyond what is present in the pretraining data. This allows us to learn a video representation that is remarkably data-efficient: our approach maintains performance when using only 25% of the pretraining videos. Experiments on 10 diverse downstream settings demonstrate our competitive performance and generalizability to new domains and fine-grained actions.

The ICCV 2023 paper Self-Ordering Point Clouds by Pengwan Yang, Cees G M Snoek, and Yuki M Asano is now available. In this paper we address the task of finding representative subsets of points in a 3D point cloud by means of a point-wise ordering. Only a few works have tried to address this challenging vision problem, all with the help of hard to obtain point and cloud labels. Different from these works, we introduce the task of point-wise ordering in 3D point clouds through self-supervision, which we call self-ordering. We further contribute the first end-to-end trainable network that learns a point-wise ordering in a self-supervised fashion. It utilizes a novel differentiable point scoring-sorting strategy and it constructs an hierarchical contrastive scheme to obtain self-supervision signals. We extensively ablate the method and show its scalability and superior performance even compared to supervised ordering methods on multiple datasets and tasks including zero-shot ordering of point clouds from unseen categories.