Are you interested in performing high-impact interdisciplinary research in Artificial Intelligence and its alignment with humans and society? The University of Amsterdam has recently started a flagship project on Human-Aligned Video AI (HAVA). The HAVA Lab will address fundamental questions about what defines human alignment with video AI, how to make this computable, and what determines its societal acceptance. 

Video AI holds the promise to explore what is unreachable, monitor what is imperceivable and to protect what is most valuable. New species have become identifiable in our deep oceans, the visually impaired profit from automated speech transcriptions of visual scenery, and elderly caregivers may be supported with an extra pair of eyes, to name just three of the many, many application examples. This is no longer wishful thinking. Broad uptake of video-AI for science, for business, and for wellbeing awaits at the horizon, thanks to a decade of phenomenal progress in machine deep learning. 

However, the same video-AI is also accountable for self-driving cars crashing into pedestrians, deep fakes that make us believe misinformation, and mass-surveillance systems that monitor our behaviour. The research community’s over-concentration on recognition accuracy has neglected human-alignment for societal acceptance. The HAVA Lab is an intern-disciplinary lab that will study how to make the much-needed digital transformation towards human-aligned video AI. 

The HAVA Lab will host 7 PhD positions working together with researchers from all 7 faculties of the university, from video AI and its alignment with human cognition, ethics, and law, to its embedding in medical domains, public safety, and business. The lab has 9 supervisors in total spanning all 7 faculties of the university for maximum interdisciplinarity. Depending on the specific topic, the PhD students also have a strong link to the working environment and faculty of their respective supervisors.  The HAVA Lab has been given a unique central location at the library, an ideal hub for interdisciplinary collaborations. The PI of the lab is prof. dr. Cees Snoek.

Five of the seven PhD positions have been filled, we are looking to fill two more PhD positions, with the following interdisciplinary focus:

  • One PhD Position on human-aligned video-AI for public safety, which will be supervised by prof. dr. Marie Rosenkrantz Lindegaard and prof. dr. Cees Snoek.
  • One PhD Position on human-aligned video-AI for surgical skills, which will be supervised by prof. dr. Marlies Schijven and prof. dr. Cees Snoek.

For more details on the vacancies pls check: https://vacatures.uva.nl/UvA/job/Two-PhD-Positions-on-Human-aligned-Video-AI/794802002/

Foundation Models, and their origin, analysis and development have been typically associated with the US and Big Tech. Yet, a critical share of important insights and novel approaches do come from Europe, both within academia and industry. Part of this winter school’s goal is to highlight these fresh perspectives and give the students an in-depth look into how Europe is guiding its own research agenda with unique directions and bringing together the community. The winter school will take place at the University of Amsterdam. For full program see: https://amsterdam-fomo.github.io.

The NeurIPS, 2023 cam-ready paper Learn to Categorize or Categorize to Learn? Self-Coding for Generalized Category Discovery by Sarah Rastegar, Hazel Doughty, Cees G M Snoek is now available. In the quest for unveiling novel categories at test time, we confront the inherent limitations of traditional supervised recognition models that are restricted by a predefined category set. While strides have been made in the realms of self-supervised and open-world learning towards test-time category discovery, a crucial yet often overlooked question persists: what exactly delineates a category? In this paper, we conceptualize a category through the lens of optimization, viewing it as an optimal solution to a well-defined problem. Harnessing this unique conceptualization, we propose a novel, efficient and self-supervised method capable of discovering previously unknown categories at test time. A salient feature of our approach is the assignment of minimum length category codes to individual data instances, which encapsulates the implicit category hierarchy prevalent in real-world datasets. This mechanism affords us enhanced control over category granularity, thereby equipping our model to handle fine-grained categories adeptly. Experimental evaluations, bolstered by state-of-the-art benchmark comparisons, testify to the efficacy of our solution in managing unknown categories at test time. Furthermore, we fortify our proposition with a theoretical foundation, providing proof of its optimality.

The NeurIPS, 2023 cam-ready paper Learning Unseen Modality Interaction by Yunhua Zhang, Hazel Doughty, Cees G M Snoek is now available. Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved. This allows the information to be accumulated with a simple summation operation across available modalities. To reduce overfitting to less discriminative modality combinations during training, we further improve the model learning with pseudo-supervision indicating the reliability of a modality’s prediction. We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval.

The NeurIPS2023 cam-ready ProtoDiff: Learning to Learn Prototypical Networks by Task-Guided Diffusion by Yingjun Du, Zehao Xiao, Shengcai Liao, Cees G M Snoek is now available. Prototype-based meta-learning has emerged as a powerful technique for addressing few-shot learning challenges. However, estimating a deterministic prototype using a simple average function from a limited number of examples remains a fragile process. To overcome this limitation, we introduce ProtoDiff, a novel framework that leverages a task-guided diffusion model during the meta-training phase to gradually generate prototypes, thereby providing efficient class representations. Specifically, a set of prototypes is optimized to achieve per-task prototype overfitting, enabling accurately obtaining the overfitted prototypes for individual tasks. Furthermore, we introduce a task-guided diffusion process within the prototype space, enabling the meta-learning of a generative process that transitions from a vanilla prototype to an overfitted prototype. ProtoDiff gradually generates task-specific prototypes from random noise during the meta-test stage, conditioned on the limited samples available for the new task. Furthermore, to expedite training and enhance ProtoDiff’s performance, we propose the utilization of residual prototype learning, which leverages the sparsity of the residual prototype. We conduct thorough ablation studies to demonstrate its ability to accurately capture the underlying prototype distribution and enhance generalization. The new state-of-the-art performance on within-domain, cross-domain, and few-task few-shot classification further substantiates the benefit of ProtoDiff.

The ICCV 2023 paper Bayesian Prompt Learning for Image-Language Model Generalization by Mohammad Mahdi Derakhshani, Enrique Sanchez, Adrian Bulat, Victor Guilherme Turrisi da Costa, Cees G M Snoek, Georgios Tzimiropoulos, Brais Martinez is now available. Foundational image-language models have generated considerable interest due to their efficient adaptation to downstream tasks by prompt learning. Prompt learning treats part of the language model input as trainable while freezing the rest, and optimizes an Empirical Risk Minimization objective. However, Empirical Risk Minimization is known to suffer from distributional shifts which hurt generalizability to prompts unseen during training. By leveraging the regularization ability of Bayesian methods, we frame prompt learning from the Bayesian perspective and formulate it as a variational inference problem. Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts. Our framework is implemented by modeling the input prompt space in a probabilistic manner, as an a priori distribution which makes our proposal compatible with prompt learning approaches that are unconditional or conditional on the image. We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space, prevents learning spurious features, and exploits transferable invariant features. This results in better generalization of unseen prompts, even across different datasets and domains.

The ICCV 2023 paper Detecting Objects with Graph Priors and Graph Refinement by Aritra Bhowmik, Martin R Oswald, Yu Wang, Nora Baka, Cees G M Snoek is now available. The goal of this paper is to detect objects by exploiting their interrelationships. Rather than relying on predefined and labeled graph structures, we infer a graph prior from object co-occurrence statistics. The key idea of our paper is to model object relations as a function of initial class predictions and co-occurrence priors to generate a graph representation of an image for improved classification and bounding box regression. We additionally learn the object-relation joint distribution via energy based modeling. Sampling from this distribution generates a refined graph representation of the image which in turn produces improved detection performance. Experiments on the Visual Genome and MS-COCO datasets demonstrate our method is detector agnostic, end-to-end trainable, and especially beneficial for rare object classes. What is more, we establish a consistent improvement over object detectors like DETR and Faster-RCNN, as well as state-of-the-art methods modeling object interrelationships.

The ICCV 2023 paper Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization by Fida Mohammad Thoker, Hazel Doughty, Cees G M Snoek is now available. We propose a self-supervised method for learning motion-focused video representations. Existing approaches minimize distances between temporally augmented videos, which maintain high spatial similarity. We instead propose to learn similarities between videos with identical local motion dynamics but an otherwise different appearance. We do so by adding synthetic motion trajectories to videos which we refer to as tubelets. By simulating different tubelet motions and applying transformations, such as scaling and rotation, we introduce motion patterns beyond what is present in the pretraining data. This allows us to learn a video representation that is remarkably data-efficient: our approach maintains performance when using only 25% of the pretraining videos. Experiments on 10 diverse downstream settings demonstrate our competitive performance and generalizability to new domains and fine-grained actions.

The ICCV 2023 paper Self-Ordering Point Clouds by Pengwan Yang, Cees G M Snoek, and Yuki M Asano is now available. In this paper we address the task of finding representative subsets of points in a 3D point cloud by means of a point-wise ordering. Only a few works have tried to address this challenging vision problem, all with the help of hard to obtain point and cloud labels. Different from these works, we introduce the task of point-wise ordering in 3D point clouds through self-supervision, which we call self-ordering. We further contribute the first end-to-end trainable network that learns a point-wise ordering in a self-supervised fashion. It utilizes a novel differentiable point scoring-sorting strategy and it constructs an hierarchical contrastive scheme to obtain self-supervision signals. We extensively ablate the method and show its scalability and superior performance even compared to supervised ordering methods on multiple datasets and tasks including zero-shot ordering of point clouds from unseen categories.