2024
|
| Mohammadreza Salehi, Michael Dorkenwald, Fida Mohammad Thoker, Efstratios Gavves, Cees G M Snoek, Yuki M Asano: SIGMA: Sinkhorn-Guided Masked Video Modeling. In: ECCV, 2024. @inproceedings{SalehiECCV2024,
title = {SIGMA: Sinkhorn-Guided Masked Video Modeling},
author = {Mohammadreza Salehi and Michael Dorkenwald and Fida Mohammad Thoker and Efstratios Gavves and Cees G M Snoek and Yuki M Asano},
url = {https://quva-lab.github.io/SIGMA/
https://arxiv.org/abs/2407.15447},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
booktitle = {ECCV},
abstract = {Video-based pretraining offers immense potential for learning strong visual representations on an unprecedented scale. Recently, masked video modeling methods have shown promising scalability, yet fall short in capturing higher-level semantics due to reconstructing predefined low-level targets such as pixels. To tackle this, we present Sinkhorn-guided Masked Video Modelling (SIGMA), a novel video pretraining method that jointly learns the video model in addition to a target feature space using a projection network. However, this simple modification means that the regular L2 reconstruction loss will lead to trivial solutions as both networks are jointly optimized. As a solution, we distribute features of space-time tubes evenly across a limited number of learnable clusters. By posing this as an optimal transport problem, we enforce high entropy in the generated features across the batch, infusing semantic and temporal meaning into the feature space. The resulting cluster assignments are used as targets for a symmetric prediction task where the video model predicts cluster assignment of the projection network and vice versa. Experimental results on ten datasets across three benchmarks validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations improving upon state-of-the-art methods.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Video-based pretraining offers immense potential for learning strong visual representations on an unprecedented scale. Recently, masked video modeling methods have shown promising scalability, yet fall short in capturing higher-level semantics due to reconstructing predefined low-level targets such as pixels. To tackle this, we present Sinkhorn-guided Masked Video Modelling (SIGMA), a novel video pretraining method that jointly learns the video model in addition to a target feature space using a projection network. However, this simple modification means that the regular L2 reconstruction loss will lead to trivial solutions as both networks are jointly optimized. As a solution, we distribute features of space-time tubes evenly across a limited number of learnable clusters. By posing this as an optimal transport problem, we enforce high entropy in the generated features across the batch, infusing semantic and temporal meaning into the feature space. The resulting cluster assignments are used as targets for a symmetric prediction task where the video model predicts cluster assignment of the projection network and vice versa. Experimental results on ten datasets across three benchmarks validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations improving upon state-of-the-art methods. |
| Sarah Rastegar, Mohammadreza Salehi, Yuki M Asano, Hazel Doughty, Cees G M Snoek: SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery. In: ECCV, 2024. @inproceedings{RastegarECCV2024,
title = {SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery},
author = {Sarah Rastegar and Mohammadreza Salehi and Yuki M Asano and Hazel Doughty and Cees G M Snoek},
url = {https://arxiv.org/abs/2408.14371
https://github.com/SarahRastegar/SelEx},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
booktitle = {ECCV},
abstract = {In this paper, we address Generalized Category Discovery, aiming to simultaneously uncover novel categories and accurately classify known ones. Traditional methods, which lean heavily on self-supervision and contrastive learning, often fall short when distinguishing between fine-grained categories. To address this, we introduce a novel concept called `self-expertise', which enhances the model's ability to recognize subtle differences and uncover unknown categories. Our approach combines unsupervised and supervised self-expertise strategies to refine the model's discernment and generalization. Initially, hierarchical pseudo-labeling is used to provide `soft supervision', improving the effectiveness of self-expertise. Our supervised technique differs from traditional methods by utilizing more abstract positive and negative samples, aiding in the formation of clusters that can generalize to novel categories. Meanwhile, our unsupervised strategy encourages the model to sharpen its category distinctions by considering within-category examples as `hard' negatives. Supported by theoretical insights, our empirical results showcase that our method outperforms existing state-of-the-art techniques in Generalized Category Discovery across several fine-grained datasets.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper, we address Generalized Category Discovery, aiming to simultaneously uncover novel categories and accurately classify known ones. Traditional methods, which lean heavily on self-supervision and contrastive learning, often fall short when distinguishing between fine-grained categories. To address this, we introduce a novel concept called `self-expertise', which enhances the model's ability to recognize subtle differences and uncover unknown categories. Our approach combines unsupervised and supervised self-expertise strategies to refine the model's discernment and generalization. Initially, hierarchical pseudo-labeling is used to provide `soft supervision', improving the effectiveness of self-expertise. Our supervised technique differs from traditional methods by utilizing more abstract positive and negative samples, aiding in the formation of clusters that can generalize to novel categories. Meanwhile, our unsupervised strategy encourages the model to sharpen its category distinctions by considering within-category examples as `hard' negatives. Supported by theoretical insights, our empirical results showcase that our method outperforms existing state-of-the-art techniques in Generalized Category Discovery across several fine-grained datasets. |
| Luc Sträter, Mohammadreza Salehi, Efstratios Gavves, Cees G M Snoek, Yuki M Asano: GeneralAD: Anomaly Detection Across Domains by Attending to Distorted Features. In: ECCV, 2024. @inproceedings{StraterECCV2024,
title = {GeneralAD: Anomaly Detection Across Domains by Attending to Distorted Features},
author = {Luc Sträter and Mohammadreza Salehi and Efstratios Gavves and Cees G M Snoek and Yuki M Asano},
url = {https://arxiv.org/abs/2407.12427},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
booktitle = {ECCV},
abstract = {In the domain of anomaly detection, methods often excel in either high-level semantic or low-level industrial benchmarks, rarely achieving cross-domain proficiency. Semantic anomalies are novelties that differ in meaning from the training set, like unseen objects in self-driving cars. In contrast, industrial anomalies are subtle defects that preserve semantic meaning, such as cracks in airplane components. In this paper, we present GeneralAD, an anomaly detection framework designed to operate in semantic, near-distribution, and industrial settings with minimal per-task adjustments. In our approach, we capitalize on the inherent design of Vision Transformers, which are trained on image patches, thereby ensuring that the last hidden states retain a patch-based structure. We propose a novel self-supervised anomaly generation module that employs straightforward operations like noise addition and shuffling to patch features to construct pseudo-abnormal samples. These features are fed to an attention-based discriminator, which is trained to score every patch in the image. With this, our method can both accurately identify anomalies at the image level and also generate interpretable anomaly maps. We extensively evaluated our approach on ten datasets, achieving state-of-the-art results in six and on-par performance in the remaining for both localization and detection tasks.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In the domain of anomaly detection, methods often excel in either high-level semantic or low-level industrial benchmarks, rarely achieving cross-domain proficiency. Semantic anomalies are novelties that differ in meaning from the training set, like unseen objects in self-driving cars. In contrast, industrial anomalies are subtle defects that preserve semantic meaning, such as cracks in airplane components. In this paper, we present GeneralAD, an anomaly detection framework designed to operate in semantic, near-distribution, and industrial settings with minimal per-task adjustments. In our approach, we capitalize on the inherent design of Vision Transformers, which are trained on image patches, thereby ensuring that the last hidden states retain a patch-based structure. We propose a novel self-supervised anomaly generation module that employs straightforward operations like noise addition and shuffling to patch features to construct pseudo-abnormal samples. These features are fed to an attention-based discriminator, which is trained to score every patch in the image. With this, our method can both accurately identify anomalies at the image level and also generate interpretable anomaly maps. We extensively evaluated our approach on ten datasets, achieving state-of-the-art results in six and on-par performance in the remaining for both localization and detection tasks. |
| Sameer Ambekar, Zehao Xiao, Jiayi Shen, Xiantong Zhen, Cees G M Snoek: Probabilistic Test-Time Generalization by Variational Neighbor-Labeling. In: CoLLAs, 2024. @inproceedings{AmberkarColla2024,
title = {Probabilistic Test-Time Generalization by Variational Neighbor-Labeling},
author = {Sameer Ambekar and Zehao Xiao and Jiayi Shen and Xiantong Zhen and Cees G M Snoek},
url = {https://arxiv.org/abs/2307.04033},
year = {2024},
date = {2024-07-29},
urldate = {2023-07-15},
booktitle = {CoLLAs},
abstract = {This paper strives for domain generalization, where models are trained exclusively on source domains before being deployed on unseen target domains. We follow the strict separation of source training and target testing, but exploit the value of the unlabeled target data itself during inference. We make three contributions. First, we propose probabilistic pseudo-labeling of target samples to generalize the source-trained model to the target domain at test time. We formulate the generalization at test time as a variational inference problem, by modeling pseudo labels as distributions, to consider the uncertainty during generalization and alleviate the misleading signal of inaccurate pseudo labels. Second, we learn variational neighbor labels that incorporate the information of neighboring target samples to generate more robust pseudo labels. Third, to learn the ability to incorporate more representative target information and generate more precise and robust variational neighbor labels, we introduce a meta-generalization stage during training to simulate the generalization procedure. Experiments on seven widely-used datasets demonstrate the benefits, abilities, and effectiveness of our proposal.},
howpublished = {arXiv:2307.04033},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper strives for domain generalization, where models are trained exclusively on source domains before being deployed on unseen target domains. We follow the strict separation of source training and target testing, but exploit the value of the unlabeled target data itself during inference. We make three contributions. First, we propose probabilistic pseudo-labeling of target samples to generalize the source-trained model to the target domain at test time. We formulate the generalization at test time as a variational inference problem, by modeling pseudo labels as distributions, to consider the uncertainty during generalization and alleviate the misleading signal of inaccurate pseudo labels. Second, we learn variational neighbor labels that incorporate the information of neighboring target samples to generate more robust pseudo labels. Third, to learn the ability to incorporate more representative target information and generate more precise and robust variational neighbor labels, we introduce a meta-generalization stage during training to simulate the generalization procedure. Experiments on seven widely-used datasets demonstrate the benefits, abilities, and effectiveness of our proposal. |
| Zenglin Shi, Pascal Mettes, Cees G M Snoek: Focus for Free in Density-Based Counting. In: International Journal of Computer Vision, vol. 132, iss. 7, pp. 2600-2617, 2024. @article{ShiIJCV2024,
title = {Focus for Free in Density-Based Counting},
author = {Zenglin Shi and Pascal Mettes and Cees G M Snoek},
url = {https://doi.org/10.1007/s11263-024-01990-3
https://arxiv.org/abs/2306.05129},
year = {2024},
date = {2024-07-01},
urldate = {2024-01-01},
journal = {International Journal of Computer Vision},
volume = {132},
issue = {7},
pages = {2600-2617},
abstract = {This work considers supervised learning to count from images and their corresponding point annotations. Where density-based counting methods typically use the point annotations only to create Gaussian-density maps, which act as the supervision signal, the starting point of this work is that point annotations have counting potential beyond density map generation. We introduce two methods that repurpose the available point annotations to enhance counting performance. The first is a counting-specific augmentation that leverages point annotations to simulate occluded objects in both input and density images to enhance the network's robustness to occlusions. The second method, foreground distillation, generates foreground masks from the point annotations, from which we train an auxiliary network on images with blacked-out backgrounds. By doing so, it learns to extract foreground counting knowledge without interference from the background. These methods can be seamlessly integrated with existing counting advances and are adaptable to different loss functions. We demonstrate complementary effects of the approaches, allowing us to achieve robust counting results even in challenging scenarios such as background clutter, occlusion, and varying crowd densities. Our proposed approach achieves strong counting results on multiple datasets, including ShanghaiTech Part_A and Part_B, UCF_QNRF, JHU-Crowd++, and NWPU-Crowd.},
howpublished = {arXiv:2306.05129},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
This work considers supervised learning to count from images and their corresponding point annotations. Where density-based counting methods typically use the point annotations only to create Gaussian-density maps, which act as the supervision signal, the starting point of this work is that point annotations have counting potential beyond density map generation. We introduce two methods that repurpose the available point annotations to enhance counting performance. The first is a counting-specific augmentation that leverages point annotations to simulate occluded objects in both input and density images to enhance the network's robustness to occlusions. The second method, foreground distillation, generates foreground masks from the point annotations, from which we train an auxiliary network on images with blacked-out backgrounds. By doing so, it learns to extract foreground counting knowledge without interference from the background. These methods can be seamlessly integrated with existing counting advances and are adaptable to different loss functions. We demonstrate complementary effects of the approaches, allowing us to achieve robust counting results even in challenging scenarios such as background clutter, occlusion, and varying crowd densities. Our proposed approach achieves strong counting results on multiple datasets, including ShanghaiTech Part_A and Part_B, UCF_QNRF, JHU-Crowd++, and NWPU-Crowd. |
| Yunhua Zhang, Hazel Doughty, Cees G M Snoek: Low-Resource Vision Challenges for Foundation Models. In: CVPR, 2024, (Best paper FGVC2024 workshop.). @inproceedings{ZhangCVPR2024,
title = {Low-Resource Vision Challenges for Foundation Models},
author = {Yunhua Zhang and Hazel Doughty and Cees G M Snoek},
url = {https://arxiv.org/abs/2401.04716
https://xiaobai1217.github.io/Low-Resource-Vision/
https://uvaauas.figshare.com/articles/dataset/Low-Resource_Image_Transfer_Evaluation_Benchmark/25577145},
year = {2024},
date = {2024-06-17},
urldate = {2024-06-17},
booktitle = {CVPR},
abstract = {Low-resource settings are well-established in natural language processing, where many languages lack sufficient data for machine learning at scale. However, low-resource problems are under-explored in computer vision. In this paper, we strive to address this gap and explore the challenges of low-resource image tasks with vision foundation models. Thus, we first collect a benchmark of genuinely low-resource image data, covering historic maps, circuit diagrams, and mechanical drawings. These low-resource settings all share the three challenges of data scarcity, fine-grained differences, and the distribution shift from natural images to the specialized domain of interest. While existing foundation models have shown impressive generalizability, we find they cannot transfer well to our low-resource tasks. To begin to tackle the challenges of low-resource vision, we introduce one simple baseline per challenge. Specifically, we propose to i) enlarge the data space by generative models, ii) adopt the best sub-kernels to encode local regions for fine-grained difference discovery and iii) learn attention for specialized domains. Experiments on the three low-resource data sources in our benchmark demonstrate our proposals already provide a better baseline than common transfer learning, data augmentation, and fine-grained methods. This highlights the unique characteristics and challenges of low-resource vision for foundation models that warrant further investigation.},
howpublished = {arXiv:2401.04716},
note = {Best paper FGVC2024 workshop.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Low-resource settings are well-established in natural language processing, where many languages lack sufficient data for machine learning at scale. However, low-resource problems are under-explored in computer vision. In this paper, we strive to address this gap and explore the challenges of low-resource image tasks with vision foundation models. Thus, we first collect a benchmark of genuinely low-resource image data, covering historic maps, circuit diagrams, and mechanical drawings. These low-resource settings all share the three challenges of data scarcity, fine-grained differences, and the distribution shift from natural images to the specialized domain of interest. While existing foundation models have shown impressive generalizability, we find they cannot transfer well to our low-resource tasks. To begin to tackle the challenges of low-resource vision, we introduce one simple baseline per challenge. Specifically, we propose to i) enlarge the data space by generative models, ii) adopt the best sub-kernels to encode local regions for fine-grained difference discovery and iii) learn attention for specialized domains. Experiments on the three low-resource data sources in our benchmark demonstrate our proposals already provide a better baseline than common transfer learning, data augmentation, and fine-grained methods. This highlights the unique characteristics and challenges of low-resource vision for foundation models that warrant further investigation. |
| Michael Dorkenwald, Nimrod Barazani, Cees G M Snoek, Yuki M Asano: PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs. In: CVPR, 2024. @inproceedings{DorkenwaldCVPR2024,
title = {PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs},
author = {Michael Dorkenwald and Nimrod Barazani and Cees G M Snoek and Yuki M Asano},
url = {https://quva-lab.github.io/PIN/
https://arxiv.org/abs/2402.08657},
year = {2024},
date = {2024-06-17},
urldate = {2024-02-13},
booktitle = {CVPR},
abstract = {Vision-Language Models (VLMs), such as Flamingo and GPT-4V, have shown immense potential by integrating large language models with vision systems. Nevertheless, these models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions without explicit spatial grounding. While it is possible to construct custom, supervised training pipelines with bounding box annotations that integrate with VLMs, these result in specialized and hard-to-scale models. In this paper, we aim to explore the limits of caption-based VLMs and instead propose to tackle the challenge in a simpler manner by i) keeping the weights of a caption-based VLM frozen and ii) not using any supervised detection data. To this end, we introduce an input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing a minimal set of parameters that are slid inside the frozen VLM, unlocking object localisation capabilities. Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads. Our experiments demonstrate strong zero-shot localisation performances on a variety of images, including Pascal VOC, COCO, LVIS, and diverse images like paintings or cartoons.},
howpublished = {arXiv:2402.08657},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Vision-Language Models (VLMs), such as Flamingo and GPT-4V, have shown immense potential by integrating large language models with vision systems. Nevertheless, these models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions without explicit spatial grounding. While it is possible to construct custom, supervised training pipelines with bounding box annotations that integrate with VLMs, these result in specialized and hard-to-scale models. In this paper, we aim to explore the limits of caption-based VLMs and instead propose to tackle the challenge in a simpler manner by i) keeping the weights of a caption-based VLM frozen and ii) not using any supervised detection data. To this end, we introduce an input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing a minimal set of parameters that are slid inside the frozen VLM, unlocking object localisation capabilities. Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads. Our experiments demonstrate strong zero-shot localisation performances on a variety of images, including Pascal VOC, COCO, LVIS, and diverse images like paintings or cartoons. |
| Zehao Xiao, Jiayi Shen, Mohammad Mahdi Derakhshani, Shengcai Liao, Cees G M Snoek: Any-Shift Prompting for Generalization over Distributions. In: CVPR, 2024. @inproceedings{XiaoCVPR2024,
title = {Any-Shift Prompting for Generalization over Distributions},
author = {Zehao Xiao and Jiayi Shen and Mohammad Mahdi Derakhshani and Shengcai Liao and Cees G M Snoek},
url = {https://arxiv.org/abs/2402.10099},
year = {2024},
date = {2024-06-17},
urldate = {2024-02-15},
booktitle = {CVPR},
abstract = {Image-language models with prompt learning have shown remarkable advances in numerous downstream vision tasks. Nevertheless, conventional prompt learning methods overfit their training distribution and lose the generalization ability on test distributions. To improve generalization across various distribution shifts, we propose any-shift prompting: a general probabilistic inference framework that considers the relationship between training and test distributions during prompt learning. We explicitly connect training and test distributions in the latent space by constructing training and test prompts in a hierarchical architecture. Within this framework, the test prompt exploits the distribution relationships to guide the generalization of the CLIP image-language model from training to any test distribution. To effectively encode the distribution information and their relationships, we further introduce a transformer inference network with a pseudo-shift training mechanism. The network generates the tailored test prompt with both training and test information in a feedforward pass, avoiding extra training costs at test time. Extensive experiments on twenty-three datasets demonstrate the effectiveness of any-shift prompting on the generalization over various distribution shifts.},
howpublished = {arXiv:2402.10099},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Image-language models with prompt learning have shown remarkable advances in numerous downstream vision tasks. Nevertheless, conventional prompt learning methods overfit their training distribution and lose the generalization ability on test distributions. To improve generalization across various distribution shifts, we propose any-shift prompting: a general probabilistic inference framework that considers the relationship between training and test distributions during prompt learning. We explicitly connect training and test distributions in the latent space by constructing training and test prompts in a hierarchical architecture. Within this framework, the test prompt exploits the distribution relationships to guide the generalization of the CLIP image-language model from training to any test distribution. To effectively encode the distribution information and their relationships, we further introduce a transformer inference network with a pseudo-shift training mechanism. The network generates the tailored test prompt with both training and test information in a feedforward pass, avoiding extra training costs at test time. Extensive experiments on twenty-three datasets demonstrate the effectiveness of any-shift prompting on the generalization over various distribution shifts. |
| Duy-Kien Nguyen, Mahmoud Assran, Unnat Jain, Martin R Oswald, Cees G M Snoek, Xinlei Chen: An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels. arXiv:2406.09415, 2024. @unpublished{NguyenArxiv2024,
title = {An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels},
author = {Duy-Kien Nguyen and Mahmoud Assran and Unnat Jain and Martin R Oswald and Cees G M Snoek and Xinlei Chen},
url = {https://arxiv.org/abs/2406.09415},
year = {2024},
date = {2024-06-13},
abstract = {This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias -- locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a token and achieve highly performant results. This is substantially different from the popular design in Vision Transformer, which maintains the inductive bias from ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a token). We mainly showcase the effectiveness of pixels-as-tokens across three well-studied tasks in computer vision: supervised learning for object classification, self-supervised learning via masked autoencoding, and image generation with diffusion models. Although directly operating on individual pixels is less computationally practical, we believe the community must be aware of this surprising piece of knowledge when devising the next generation of neural architectures for computer vision.},
howpublished = {arXiv:2406.09415},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias -- locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a token and achieve highly performant results. This is substantially different from the popular design in Vision Transformer, which maintains the inductive bias from ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a token). We mainly showcase the effectiveness of pixels-as-tokens across three well-studied tasks in computer vision: supervised learning for object classification, self-supervised learning via masked autoencoding, and image generation with diffusion models. Although directly operating on individual pixels is less computationally practical, we believe the community must be aware of this surprising piece of knowledge when devising the next generation of neural architectures for computer vision. |
| Sarah Rastegar, Hazel Doughty, Cees G M Snoek: Background No More: Action Recognition Across Domains by Causal Interventions. In: Computer Vision and Image Understanding, vol. 242, 2024. @article{RastegarCVIU2024,
title = {Background No More: Action Recognition Across Domains by Causal Interventions},
author = {Sarah Rastegar and Hazel Doughty and Cees G M Snoek},
url = {https://doi.org/10.1016/j.cviu.2024.103975},
year = {2024},
date = {2024-05-01},
urldate = {2024-01-01},
journal = {Computer Vision and Image Understanding},
volume = {242},
abstract = {We aim to recognize actions under an appearance distribution-shift between a source training-domain and target test-domain. To enable such video domain generalization, our key idea is to intervene on the action to remove the confounding effect of the domain-background on the class label using causal inference. Towards this, we propose to learn a causally debiased model on a source domain that intervenes on the action through three possible $Do$-operators which separate the action and background. To better align the source and target distributions we also introduce a test-time action intervention. Experiments on two challenging video domain generalization benchmarks reveal that causal inference is a promising tool for action recognition as it already achieves state-of-the-art results on Kinetics2Mimetics, the benchmark with the largest domain shift.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
We aim to recognize actions under an appearance distribution-shift between a source training-domain and target test-domain. To enable such video domain generalization, our key idea is to intervene on the action to remove the confounding effect of the domain-background on the class label using causal inference. Towards this, we propose to learn a causally debiased model on a source domain that intervenes on the action through three possible $Do$-operators which separate the action and background. To better align the source and target distributions we also introduce a test-time action intervention. Experiments on two challenging video domain generalization benchmarks reveal that causal inference is a promising tool for action recognition as it already achieves state-of-the-art results on Kinetics2Mimetics, the benchmark with the largest domain shift. |
| Duy-Kien Nguyen, Vaibhav Aggarwal, Yanghao Li, Martin R Oswald, Alexander Kirillov, Cees G M Snoek, Xinlei Chen: R-MAE: Regions Meet Masked Autoencoders. In: ICLR, 2024. @inproceedings{NguyenICLR2024,
title = {R-MAE: Regions Meet Masked Autoencoders},
author = {Duy-Kien Nguyen and Vaibhav Aggarwal and Yanghao Li and Martin R Oswald and Alexander Kirillov and Cees G M Snoek and Xinlei Chen},
url = {https://arxiv.org/abs/2306.05411
https://github.com/facebookresearch/r-mae},
year = {2024},
date = {2024-05-01},
urldate = {2024-05-01},
booktitle = {ICLR},
abstract = {In this work, we explore regions as a potential visual analogue of words for self-supervised image representation learning. Inspired by Masked Autoencoding (MAE), a generative pre-training baseline, we propose masked region autoencoding to learn from groups of pixels or regions. Specifically, we design an architecture which efficiently addresses the one-to-many mapping between images and regions, while being highly effective especially with high-quality regions. When integrated with MAE, our approach (R-MAE) demonstrates consistent improvements across various pre-training datasets and downstream detection and segmentation benchmarks, with negligible computational overheads. Beyond the quantitative evaluation, our analysis indicates the models pre-trained with masked region autoencoding unlock the potential for interactive segmentation.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this work, we explore regions as a potential visual analogue of words for self-supervised image representation learning. Inspired by Masked Autoencoding (MAE), a generative pre-training baseline, we propose masked region autoencoding to learn from groups of pixels or regions. Specifically, we design an architecture which efficiently addresses the one-to-many mapping between images and regions, while being highly effective especially with high-quality regions. When integrated with MAE, our approach (R-MAE) demonstrates consistent improvements across various pre-training datasets and downstream detection and segmentation benchmarks, with negligible computational overheads. Beyond the quantitative evaluation, our analysis indicates the models pre-trained with masked region autoencoding unlock the potential for interactive segmentation. |
| Miltiadis Kofinas, Boris Knyazev, Yan Zhang, Yunlu Chen, Gertjan J Burghouts, Efstratios Gavves, Cees G M Snoek, David W Zhang: Graph Neural Networks for Learning Equivariant Representations of Neural Networks. In: ICLR, 2024, (Oral presentation). @inproceedings{KofinasICLR2024,
title = {Graph Neural Networks for Learning Equivariant Representations of Neural Networks},
author = {Miltiadis Kofinas and Boris Knyazev and Yan Zhang and Yunlu Chen and Gertjan J Burghouts and Efstratios Gavves and Cees G M Snoek and David W Zhang},
url = {https://github.com/mkofinas/neural-graphs
https://arxiv.org/abs/2403.12143},
year = {2024},
date = {2024-05-01},
urldate = {2024-05-01},
booktitle = {ICLR},
abstract = {Neural networks that process the parameters of other neural networks find applications in domains as diverse as classifying implicit neural representations, generating neural network weights, and predicting generalization errors. However, existing approaches either overlook the inherent permutation symmetry in the neural network or rely on intricate weight-sharing patterns to achieve equivariance, while ignoring the impact of the network architecture itself. In this work, we propose to represent neural networks as computational graphs of parameters, which allows us to harness powerful graph neural networks and transformers that preserve permutation symmetry. Consequently, our approach enables a single model to encode neural computational graphs with diverse architectures. We showcase the effectiveness of our method on a wide range of tasks, including classification and editing of implicit neural representations, predicting generalization performance, and learning to optimize, while consistently outperforming state-of-the-art methods.},
note = {Oral presentation},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Neural networks that process the parameters of other neural networks find applications in domains as diverse as classifying implicit neural representations, generating neural network weights, and predicting generalization errors. However, existing approaches either overlook the inherent permutation symmetry in the neural network or rely on intricate weight-sharing patterns to achieve equivariance, while ignoring the impact of the network architecture itself. In this work, we propose to represent neural networks as computational graphs of parameters, which allows us to harness powerful graph neural networks and transformers that preserve permutation symmetry. Consequently, our approach enables a single model to encode neural computational graphs with diverse architectures. We showcase the effectiveness of our method on a wide range of tasks, including classification and editing of implicit neural representations, predicting generalization performance, and learning to optimize, while consistently outperforming state-of-the-art methods. |
| Wenfang Sun, Yingjun Du, Gaowen Liu, Ramana Kompella, Cees G M Snoek: Training-Free Semantic Segmentation via LLM-Supervision. arXiv:2404.00701, 2024. @unpublished{SunArxiv2024,
title = {Training-Free Semantic Segmentation via LLM-Supervision},
author = {Wenfang Sun and Yingjun Du and Gaowen Liu and Ramana Kompella and Cees G M Snoek},
url = {https://arxiv.org/abs/2404.00701},
year = {2024},
date = {2024-04-01},
abstract = {Recent advancements in open vocabulary models, like CLIP, have notably advanced zero-shot classification and segmentation by utilizing natural language for class-specific embeddings. However, most research has focused on improving model accuracy through prompt engineering, prompt learning, or fine-tuning with limited labeled data, thereby overlooking the importance of refining the class descriptors. This paper introduces a new approach to text-supervised semantic segmentation using supervision by a large language model (LLM) that does not require extra training. Our method starts from an LLM, like GPT-3, to generate a detailed set of subclasses for more accurate class representation. We then employ an advanced text-supervised semantic segmentation model to apply the generated subclasses as target labels, resulting in diverse segmentation results tailored to each subclass's unique characteristics. Additionally, we propose an assembly that merges the segmentation maps from the various subclass descriptors to ensure a more comprehensive representation of the different aspects in the test images. Through comprehensive experiments on three standard benchmarks, our method outperforms traditional text-supervised semantic segmentation methods by a marked margin.},
howpublished = {arXiv:2404.00701},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
Recent advancements in open vocabulary models, like CLIP, have notably advanced zero-shot classification and segmentation by utilizing natural language for class-specific embeddings. However, most research has focused on improving model accuracy through prompt engineering, prompt learning, or fine-tuning with limited labeled data, thereby overlooking the importance of refining the class descriptors. This paper introduces a new approach to text-supervised semantic segmentation using supervision by a large language model (LLM) that does not require extra training. Our method starts from an LLM, like GPT-3, to generate a detailed set of subclasses for more accurate class representation. We then employ an advanced text-supervised semantic segmentation model to apply the generated subclasses as target labels, resulting in diverse segmentation results tailored to each subclass's unique characteristics. Additionally, we propose an assembly that merges the segmentation maps from the various subclass descriptors to ensure a more comprehensive representation of the different aspects in the test images. Through comprehensive experiments on three standard benchmarks, our method outperforms traditional text-supervised semantic segmentation methods by a marked margin. |
| Vincent Tao Hu, Di Wu, Yuki M Asano, Pascal Mettes, Basura Fernando, Björn Ommer, Cees G M Snoek: Flow Matching for Conditional Text Generation in a Few Sampling Steps. In: EACL, 2024. @inproceedings{HuEACL2024,
title = {Flow Matching for Conditional Text Generation in a Few Sampling Steps},
author = {Vincent Tao Hu and Di Wu and Yuki M Asano and Pascal Mettes and Basura Fernando and Björn Ommer and Cees G M Snoek},
url = {https://aclanthology.org/2024.eacl-short.33.pdf},
year = {2024},
date = {2024-03-27},
urldate = {2024-03-27},
booktitle = {EACL},
abstract = {Diffusion models are a promising tool for high-quality text generation. However, current models face multiple drawbacks including slow sampling, noise schedule sensitivity, and misalignment between the training and sampling stages. In this paper, we introduce FlowSeq, which bypasses all current drawbacks by leveraging flow matching for conditional text generation. FlowSeq can generate text in a few steps by training with a novel anchor loss, alleviating the need for expensive hyperparameter optimization of the noise schedule prevalent in diffusion models. We extensively evaluate our proposed method and show competitive performance in tasks such as question generation, open-domain dialogue, and paraphrasing.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Diffusion models are a promising tool for high-quality text generation. However, current models face multiple drawbacks including slow sampling, noise schedule sensitivity, and misalignment between the training and sampling stages. In this paper, we introduce FlowSeq, which bypasses all current drawbacks by leveraging flow matching for conditional text generation. FlowSeq can generate text in a few steps by training with a novel anchor loss, alleviating the need for expensive hyperparameter optimization of the noise schedule prevalent in diffusion models. We extensively evaluate our proposed method and show competitive performance in tasks such as question generation, open-domain dialogue, and paraphrasing. |
| Vincent Tao Hu, David W Zhang, Mang Tang, Pascal Mettes, Deli Zhao, Cees G M Snoek: Latent Space Editing in Transformer-Based Flow Matching. In: AAAI Conference on Artificial Intelligence, 2024. @inproceedings{HuAAAI2024,
title = {Latent Space Editing in Transformer-Based Flow Matching},
author = {Vincent Tao Hu and David W Zhang and Mang Tang and Pascal Mettes and Deli Zhao and Cees G M Snoek},
url = {https://arxiv.org/abs/2312.10825},
year = {2024},
date = {2024-02-01},
urldate = {2024-02-01},
booktitle = {AAAI Conference on Artificial Intelligence},
abstract = {This paper strives for image editing via generative models. Flow Matching is an emerging generative modeling technique that offers the advantage of simple and efficient training. Simultaneously, a new transformer-based U-ViT has recently been proposed to replace the commonly used UNet for better scalability and performance in generative modeling. Hence, Flow Matching with a transformer backbone offers the potential for scalable and high-quality generative modeling, but their latent structure and editing ability are as of yet unknown. Hence, we adopt this setting and explore how to edit images through latent space manipulation. We introduce an editing space, which we call $u$-space, that can be manipulated in a controllable, accumulative, and composable manner. Additionally, we propose a tailored sampling solution to enable sampling with the more efficient adaptive step-size ODE solvers. Lastly, we put forth a straightforward yet powerful method for achieving fine-grained and nuanced editing using text prompts. Our framework is simple and efficient, all while being highly effective at editing images while preserving the essence of the original content. We will provide our source code and include it in the appendix.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper strives for image editing via generative models. Flow Matching is an emerging generative modeling technique that offers the advantage of simple and efficient training. Simultaneously, a new transformer-based U-ViT has recently been proposed to replace the commonly used UNet for better scalability and performance in generative modeling. Hence, Flow Matching with a transformer backbone offers the potential for scalable and high-quality generative modeling, but their latent structure and editing ability are as of yet unknown. Hence, we adopt this setting and explore how to edit images through latent space manipulation. We introduce an editing space, which we call $u$-space, that can be manipulated in a controllable, accumulative, and composable manner. Additionally, we propose a tailored sampling solution to enable sampling with the more efficient adaptive step-size ODE solvers. Lastly, we put forth a straightforward yet powerful method for achieving fine-grained and nuanced editing using text prompts. Our framework is simple and efficient, all while being highly effective at editing images while preserving the essence of the original content. We will provide our source code and include it in the appendix. |
| Yunhua Zhang, Hazel Doughty, Cees G M Snoek: Day2Dark: Pseudo-Supervised Activity Recognition beyond Silent Daylight. In: International Journal of Computer Vision, 2024, (Pending major revision.). @article{ZhangArxive2023,
title = {Day2Dark: Pseudo-Supervised Activity Recognition beyond Silent Daylight},
author = {Yunhua Zhang and Hazel Doughty and Cees G M Snoek},
url = {https://arxiv.org/abs/2212.02053},
year = {2024},
date = {2024-01-01},
urldate = {2024-01-01},
journal = {International Journal of Computer Vision},
abstract = {This paper strives to recognize activities in the dark, as well as in the day. We first establish that state-of-the-art activity recognizers are effective during the day, but not trustworthy in the dark. The main causes are the limited availability of labeled dark videos to learn from, as well as the distribution shift towards the lower color contrast at test-time. To compensate for the lack of labeled dark videos, we introduce a pseudo-supervised learning scheme, which utilizes easy to obtain unlabeled and task-irrelevant dark videos to improve an activity recognizer in low light. As the lower color contrast results in visual information loss, we further propose to incorporate the complementary activity information within audio, which is invariant to illumination. Since the usefulness of audio and visual features differs depending on the amount of illumination, we introduce our `darkness-adaptive' audio-visual recognizer. Experiments on EPIC-Kitchens, Kinetics-Sound, and Charades demonstrate our proposals are superior to image enhancement, domain adaptation and alternative audio-visual fusion methods, and can even improve robustness to local darkness caused by occlusions. },
note = {Pending major revision.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
This paper strives to recognize activities in the dark, as well as in the day. We first establish that state-of-the-art activity recognizers are effective during the day, but not trustworthy in the dark. The main causes are the limited availability of labeled dark videos to learn from, as well as the distribution shift towards the lower color contrast at test-time. To compensate for the lack of labeled dark videos, we introduce a pseudo-supervised learning scheme, which utilizes easy to obtain unlabeled and task-irrelevant dark videos to improve an activity recognizer in low light. As the lower color contrast results in visual information loss, we further propose to incorporate the complementary activity information within audio, which is invariant to illumination. Since the usefulness of audio and visual features differs depending on the amount of illumination, we introduce our `darkness-adaptive' audio-visual recognizer. Experiments on EPIC-Kitchens, Kinetics-Sound, and Charades demonstrate our proposals are superior to image enhancement, domain adaptation and alternative audio-visual fusion methods, and can even improve robustness to local darkness caused by occlusions. |
| Yingjun Du, Haoliang Sun, Xiantong Zhen, Jun Xu, Yilong Yin, Ling Shao, Cees G M Snoek: MetaKernel: Learning Variational Random Features with Limited Labels. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, iss. 3, pp. 1464-1478, 2024. @article{DuPAMI24,
title = {MetaKernel: Learning Variational Random Features with Limited Labels},
author = {Yingjun Du and Haoliang Sun and Xiantong Zhen and Jun Xu and Yilong Yin and Ling Shao and Cees G M Snoek},
url = {https://arxiv.org/abs/2105.03781},
doi = {https://doi.org/10.1109/TPAMI.2022.3154930},
year = {2024},
date = {2024-01-01},
urldate = {2024-01-01},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
volume = {46},
issue = {3},
pages = {1464-1478},
abstract = {Few-shot learning deals with the fundamental and challenging problem of learning from a few annotated samples, while being able to generalize well on new tasks. The crux of few-shot learning is to extract prior knowledge from related tasks to enable fast adaptation to a new task with a limited amount of data. In this paper, we propose meta-learning kernels with random Fourier features for few-shot learning, we call MetaKernel. Specifically, we propose learning variational random features in a data-driven manner to obtain task-specific kernels by leveraging the shared knowledge provided by related tasks in a meta-learning setting. We treat the random feature basis as the latent variable, which is estimated by variational inference. The shared knowledge from related tasks is incorporated into a context inference of the posterior, which we achieve via a long-short term memory module. To establish more expressive kernels, we deploy conditional normalizing flows based on coupling layers to achieve a richer posterior distribution over random Fourier bases. The resultant kernels are more informative and discriminative, which further improves the few-shot learning. To evaluate our method, we conduct extensive experiments on both few-shot image classification and regression tasks. A thorough ablation study demonstrates that the effectiveness of each introduced component in our method. The benchmark results on fourteen datasets demonstrate MetaKernel consistently delivers at least comparable and often better performance than state-of-the-art alternatives.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Few-shot learning deals with the fundamental and challenging problem of learning from a few annotated samples, while being able to generalize well on new tasks. The crux of few-shot learning is to extract prior knowledge from related tasks to enable fast adaptation to a new task with a limited amount of data. In this paper, we propose meta-learning kernels with random Fourier features for few-shot learning, we call MetaKernel. Specifically, we propose learning variational random features in a data-driven manner to obtain task-specific kernels by leveraging the shared knowledge provided by related tasks in a meta-learning setting. We treat the random feature basis as the latent variable, which is estimated by variational inference. The shared knowledge from related tasks is incorporated into a context inference of the posterior, which we achieve via a long-short term memory module. To establish more expressive kernels, we deploy conditional normalizing flows based on coupling layers to achieve a richer posterior distribution over random Fourier bases. The resultant kernels are more informative and discriminative, which further improves the few-shot learning. To evaluate our method, we conduct extensive experiments on both few-shot image classification and regression tasks. A thorough ablation study demonstrates that the effectiveness of each introduced component in our method. The benchmark results on fourteen datasets demonstrate MetaKernel consistently delivers at least comparable and often better performance than state-of-the-art alternatives. |
2023
|
| Vincent Tao Hu, Yunlu Chen, Mathilde Caron, Yuki M Asano, Cees G M Snoek, Bjorn Ommer: Guided Diffusion from Self-Supervised Diffusion Features. arXiv:2312.08825, 2023. @unpublished{HuArxive2023,
title = {Guided Diffusion from Self-Supervised Diffusion Features},
author = {Vincent Tao Hu and Yunlu Chen and Mathilde Caron and Yuki M Asano and Cees G M Snoek and Bjorn Ommer},
url = {https://browse.arxiv.org/abs/2312.08825},
year = {2023},
date = {2023-12-14},
urldate = {2023-12-14},
abstract = {Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or classifier pretraining. That is why guidance was harnessed from self-supervised learning backbones, like DINO. However, recent studies have revealed that the feature representation derived from diffusion model itself is discriminative for numerous downstream tasks as well, which prompts us to propose a framework to extract guidance from, and specifically for, diffusion models. Our research has yielded several significant contributions. Firstly, the guidance signals from diffusion models are on par with those from class-conditioned diffusion models. Secondly, feature regularization, when based on the Sinkhorn-Knopp algorithm, can further enhance feature discriminability in comparison to unconditional diffusion models. Thirdly, we have constructed an online training approach that can concurrently derive guidance from diffusion models for diffusion models. Lastly, we have extended the application of diffusion models along the constant velocity path of ODE to achieve a more favorable balance between sampling steps and fidelity. The performance of our methods has been outstanding, outperforming related baseline comparisons in large-resolution datasets, such as ImageNet256, ImageNet256-100 and LSUN-Churches. Our code will be released.},
howpublished = {arXiv:2312.08825},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or classifier pretraining. That is why guidance was harnessed from self-supervised learning backbones, like DINO. However, recent studies have revealed that the feature representation derived from diffusion model itself is discriminative for numerous downstream tasks as well, which prompts us to propose a framework to extract guidance from, and specifically for, diffusion models. Our research has yielded several significant contributions. Firstly, the guidance signals from diffusion models are on par with those from class-conditioned diffusion models. Secondly, feature regularization, when based on the Sinkhorn-Knopp algorithm, can further enhance feature discriminability in comparison to unconditional diffusion models. Thirdly, we have constructed an online training approach that can concurrently derive guidance from diffusion models for diffusion models. Lastly, we have extended the application of diffusion models along the constant velocity path of ODE to achieve a more favorable balance between sampling steps and fidelity. The performance of our methods has been outstanding, outperforming related baseline comparisons in large-resolution datasets, such as ImageNet256, ImageNet256-100 and LSUN-Churches. Our code will be released. |
| Vincent Tao Hu, Wenzhe Yin, Pingchuan Ma, Yunlu Chen, Basura Fernando, Yuki M Asano, Efstratios Gavves, Pascal Mettes, Bjorn Ommer, Cees G. M. Snoek: Motion Flow Matching for Human Motion Synthesis and Editing. arXiv:2312.08895, 2023. @unpublished{HuArxive2023b,
title = {Motion Flow Matching for Human Motion Synthesis and Editing},
author = {Vincent Tao Hu and Wenzhe Yin and Pingchuan Ma and Yunlu Chen and Basura Fernando and Yuki M Asano and Efstratios Gavves and Pascal Mettes and Bjorn Ommer and Cees G. M. Snoek},
url = {https://browse.arxiv.org/abs/2312.08895},
year = {2023},
date = {2023-12-14},
urldate = {2023-12-14},
abstract = {Human motion synthesis is a fundamental task in computer animation. Recent methods based on diffusion models or GPT structure demonstrate commendable performance but exhibit drawbacks in terms of slow sampling speeds and error accumulation. In this paper, we propose emph{Motion Flow Matching}, a novel generative model designed for human motion generation featuring efficient sampling and effectiveness in motion editing applications. Our method reduces the sampling complexity from thousand steps in previous diffusion models to just ten steps, while achieving comparable performance in text-to-motion and action-to-motion generation benchmarks. Noticeably, our approach establishes a new state-of-the-art Fréchet Inception Distance on the KIT-ML dataset. What is more, we tailor a straightforward motion editing paradigm named emph{sampling trajectory rewriting} leveraging the ODE-style generative models and apply it to various editing scenarios including motion prediction, motion in-between prediction, motion interpolation, and upper-body editing. Our code will be released.},
howpublished = {arXiv:2312.08895},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
Human motion synthesis is a fundamental task in computer animation. Recent methods based on diffusion models or GPT structure demonstrate commendable performance but exhibit drawbacks in terms of slow sampling speeds and error accumulation. In this paper, we propose emph{Motion Flow Matching}, a novel generative model designed for human motion generation featuring efficient sampling and effectiveness in motion editing applications. Our method reduces the sampling complexity from thousand steps in previous diffusion models to just ten steps, while achieving comparable performance in text-to-motion and action-to-motion generation benchmarks. Noticeably, our approach establishes a new state-of-the-art Fréchet Inception Distance on the KIT-ML dataset. What is more, we tailor a straightforward motion editing paradigm named emph{sampling trajectory rewriting} leveraging the ODE-style generative models and apply it to various editing scenarios including motion prediction, motion in-between prediction, motion interpolation, and upper-body editing. Our code will be released. |
| Aritra Bhowmik, Martin R Oswald, Pascal Mettes, Cees G M Snoek: Revisiting Proposal-based Object Detection. arXiv:2311.18512, 2023. @unpublished{BhowmikArxive2023,
title = {Revisiting Proposal-based Object Detection},
author = {Aritra Bhowmik and Martin R Oswald and Pascal Mettes and Cees G M Snoek},
url = {https://arxiv.org/abs/2311.18512},
year = {2023},
date = {2023-11-30},
urldate = {2023-11-30},
abstract = {This paper revisits the pipeline for detecting objects in images with proposals. For any object detector, the obtained box proposals or queries need to be classified and regressed towards ground truth boxes. The common solution for the final predictions is to directly maximize the overlap between each proposal and the ground truth box, followed by a winner-takes-all ranking or non-maximum suppression. In this work, we propose a simple yet effective alternative. For proposal regression, we solve a simpler problem where we regress to the area of intersection between proposal and ground truth. In this way, each proposal only specifies which part contains the object, avoiding a blind inpainting problem where proposals need to be regressed beyond their visual scope. In turn, we replace the winner-takes-all strategy and obtain the final prediction by taking the union over the regressed intersections of a proposal group surrounding an object. Our revisited approach comes with minimal changes to the detection pipeline and can be plugged into any existing method. We show that our approach directly improves canonical object detection and instance segmentation architectures, highlighting the utility of intersection-based regression and grouping.},
howpublished = {arXiv:2311.18512},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
This paper revisits the pipeline for detecting objects in images with proposals. For any object detector, the obtained box proposals or queries need to be classified and regressed towards ground truth boxes. The common solution for the final predictions is to directly maximize the overlap between each proposal and the ground truth box, followed by a winner-takes-all ranking or non-maximum suppression. In this work, we propose a simple yet effective alternative. For proposal regression, we solve a simpler problem where we regress to the area of intersection between proposal and ground truth. In this way, each proposal only specifies which part contains the object, avoiding a blind inpainting problem where proposals need to be regressed beyond their visual scope. In turn, we replace the winner-takes-all strategy and obtain the final prediction by taking the union over the regressed intersections of a proposal group surrounding an object. Our revisited approach comes with minimal changes to the detection pipeline and can be plugged into any existing method. We show that our approach directly improves canonical object detection and instance segmentation architectures, highlighting the utility of intersection-based regression and grouping. |
| Mohammad Mahdi Derakhshani, Menglin Xia, Harkirat Behl, Cees G M Snoek, Victor Rühle: Unlocking Spatial Comprehension in Text-to-Image Diffusion Models. arXiv:2311.17937, 2023. @unpublished{DerakhshaniArxive2023b,
title = {Unlocking Spatial Comprehension in Text-to-Image Diffusion Models},
author = {Mohammad Mahdi Derakhshani and Menglin Xia and Harkirat Behl and Cees G M Snoek and Victor Rühle},
url = {https://arxiv.org/abs/2311.17937},
year = {2023},
date = {2023-11-28},
urldate = {2023-11-28},
abstract = {We propose CompFuser, an image generation pipeline that enhances spatial comprehension and attribute assignment in text-to-image generative models. Our pipeline enables the interpretation of instructions defining spatial relationships between objects in a scene, such as `An image of a gray cat on the left of an orange dog', and generate corresponding images. This is especially important in order to provide more control to the user. CompFuser overcomes the limitation of existing text-to-image diffusion models by decoding the generation of multiple objects into iterative steps: first generating a single object and then editing the image by placing additional objects in their designated positions. To create training data for spatial comprehension and attribute assignment we introduce a synthetic data generation process, that leverages a frozen large language model and a frozen layout-based diffusion model for object placement. We compare our approach to strong baselines and show that our model outperforms state-of-the-art image generation models in spatial comprehension and attribute assignment, despite being 3x to 5x smaller in parameters.},
howpublished = {arXiv:2311.17937},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
We propose CompFuser, an image generation pipeline that enhances spatial comprehension and attribute assignment in text-to-image generative models. Our pipeline enables the interpretation of instructions defining spatial relationships between objects in a scene, such as `An image of a gray cat on the left of an orange dog', and generate corresponding images. This is especially important in order to provide more control to the user. CompFuser overcomes the limitation of existing text-to-image diffusion models by decoding the generation of multiple objects into iterative steps: first generating a single object and then editing the image by placing additional objects in their designated positions. To create training data for spatial comprehension and attribute assignment we introduce a synthetic data generation process, that leverages a frozen large language model and a frozen layout-based diffusion model for object placement. We compare our approach to strong baselines and show that our model outperforms state-of-the-art image generation models in spatial comprehension and attribute assignment, despite being 3x to 5x smaller in parameters. |
| Duy-Kien Nguyen, Martin R Oswald, Cees G M Snoek: SimPLR: A Simple and Plain Transformer for Scaling-Efficient Object Detection and Segmentation. arXiv:2310.05920, 2023. @unpublished{NguyenArxive2023,
title = {SimPLR: A Simple and Plain Transformer for Scaling-Efficient Object Detection and Segmentation},
author = {Duy-Kien Nguyen and Martin R Oswald and Cees G M Snoek},
url = {https://arxiv.org/abs/2310.05920},
year = {2023},
date = {2023-10-09},
urldate = {2023-10-09},
abstract = {The ability to detect objects in images at varying scales has played a pivotal role in the design of modern object detectors. Despite considerable progress in removing hand-crafted components and simplifying the architecture with transformers, multi-scale feature maps and/or pyramid design remain a key factor for their empirical success. In this paper, we show that this reliance on either feature pyramids or an hierarchical backbone is unnecessary and a transformer-based detector with scale-aware attention enables the plain detector `SimPLR' whose backbone and detection head are both non-hierarchical and operate on single-scale features. We find through our experiments that SimPLR with scale-aware attention is plain and simple, yet competitive with multi-scale vision transformer alternatives. Compared to the multi-scale and single-scale state-of-the-art, our model scales much better with bigger capacity (self-supervised) models and more pre-training data, allowing us to report a consistently better accuracy and faster runtime for object detection, instance segmentation as well as panoptic segmentation. Code will be released.},
howpublished = {arXiv:2310.05920},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
The ability to detect objects in images at varying scales has played a pivotal role in the design of modern object detectors. Despite considerable progress in removing hand-crafted components and simplifying the architecture with transformers, multi-scale feature maps and/or pyramid design remain a key factor for their empirical success. In this paper, we show that this reliance on either feature pyramids or an hierarchical backbone is unnecessary and a transformer-based detector with scale-aware attention enables the plain detector `SimPLR' whose backbone and detection head are both non-hierarchical and operate on single-scale features. We find through our experiments that SimPLR with scale-aware attention is plain and simple, yet competitive with multi-scale vision transformer alternatives. Compared to the multi-scale and single-scale state-of-the-art, our model scales much better with bigger capacity (self-supervised) models and more pre-training data, allowing us to report a consistently better accuracy and faster runtime for object detection, instance segmentation as well as panoptic segmentation. Code will be released. |
| Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Cees G M Snoek, Marcel Worring, Yuki M Asano: Small Visual Language Models can also be Open-Ended Few-Shot Learners. arXiv:2310.00500, 2023. @unpublished{DerakhshaniArxive2023,
title = {Small Visual Language Models can also be Open-Ended Few-Shot Learners},
author = {Mohammad Mahdi Derakhshani and Ivona Najdenkoska and Cees G M Snoek and Marcel Worring and Yuki M Asano},
url = {https://arxiv.org/abs/2310.00500},
year = {2023},
date = {2023-09-30},
urldate = {2023-09-30},
abstract = {We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks open-ended few-shot abilities of small visual language models. Our proposed adaptation algorithm explicitly learns from symbolic, yet self-supervised training tasks. Specifically, our approach imitates image captions in a self-supervised way based on clustering a large pool of images followed by assigning semantically-unrelated names to clusters. By doing so, we construct the `self-context', a training signal consisting of interleaved sequences of image and pseudo-caption pairs and a query image for which the model is trained to produce the right pseudo-caption. We demonstrate the performance and flexibility of SeCAt on several multimodal few-shot datasets, spanning various granularities. By using models with approximately 1B parameters we outperform the few-shot abilities of much larger models, such as Frozen and FROMAGe. SeCAt opens new possibilities for research in open-ended few-shot learning that otherwise requires access to large or proprietary models.},
howpublished = {arXiv:2310.00500},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks open-ended few-shot abilities of small visual language models. Our proposed adaptation algorithm explicitly learns from symbolic, yet self-supervised training tasks. Specifically, our approach imitates image captions in a self-supervised way based on clustering a large pool of images followed by assigning semantically-unrelated names to clusters. By doing so, we construct the `self-context', a training signal consisting of interleaved sequences of image and pseudo-caption pairs and a query image for which the model is trained to produce the right pseudo-caption. We demonstrate the performance and flexibility of SeCAt on several multimodal few-shot datasets, spanning various granularities. By using models with approximately 1B parameters we outperform the few-shot abilities of much larger models, such as Frozen and FROMAGe. SeCAt opens new possibilities for research in open-ended few-shot learning that otherwise requires access to large or proprietary models. |
| Yingjun Du, Zehao Xiao, Shengcai Liao, Cees G M Snoek: ProtoDiff: Learning to Learn Prototypical Networks by Task-Guided Diffusion. In: NeurIPS, 2023. @inproceedings{DuNeurips2023,
title = {ProtoDiff: Learning to Learn Prototypical Networks by Task-Guided Diffusion},
author = {Yingjun Du and Zehao Xiao and Shengcai Liao and Cees G M Snoek},
url = {https://arxiv.org/abs/2306.14770},
year = {2023},
date = {2023-09-23},
urldate = {2023-06-27},
booktitle = {NeurIPS},
abstract = {Prototype-based meta-learning has emerged as a powerful technique for addressing few-shot learning challenges. However, estimating a deterministic prototype using a simple average function from a limited number of examples remains a fragile process. To overcome this limitation, we introduce ProtoDiff, a novel framework that leverages a task-guided diffusion model during the meta-training phase to gradually generate prototypes, thereby providing efficient class representations. Specifically, a set of prototypes is optimized to achieve per-task prototype overfitting, enabling accurately obtaining the overfitted prototypes for individual tasks. Furthermore, we introduce a task-guided diffusion process within the prototype space, enabling the meta-learning of a generative process that transitions from a vanilla prototype to an overfitted prototype. ProtoDiff gradually generates task-specific prototypes from random noise during the meta-test stage, conditioned on the limited samples available for the new task. Furthermore, to expedite training and enhance ProtoDiff's performance, we propose the utilization of residual prototype learning, which leverages the sparsity of the residual prototype. We conduct thorough ablation studies to demonstrate its ability to accurately capture the underlying prototype distribution and enhance generalization. The new state-of-the-art performance on within-domain, cross-domain, and few-task few-shot classification further substantiates the benefit of ProtoDiff.},
howpublished = {arXiv:2306.14770},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Prototype-based meta-learning has emerged as a powerful technique for addressing few-shot learning challenges. However, estimating a deterministic prototype using a simple average function from a limited number of examples remains a fragile process. To overcome this limitation, we introduce ProtoDiff, a novel framework that leverages a task-guided diffusion model during the meta-training phase to gradually generate prototypes, thereby providing efficient class representations. Specifically, a set of prototypes is optimized to achieve per-task prototype overfitting, enabling accurately obtaining the overfitted prototypes for individual tasks. Furthermore, we introduce a task-guided diffusion process within the prototype space, enabling the meta-learning of a generative process that transitions from a vanilla prototype to an overfitted prototype. ProtoDiff gradually generates task-specific prototypes from random noise during the meta-test stage, conditioned on the limited samples available for the new task. Furthermore, to expedite training and enhance ProtoDiff's performance, we propose the utilization of residual prototype learning, which leverages the sparsity of the residual prototype. We conduct thorough ablation studies to demonstrate its ability to accurately capture the underlying prototype distribution and enhance generalization. The new state-of-the-art performance on within-domain, cross-domain, and few-task few-shot classification further substantiates the benefit of ProtoDiff. |
| Sarah Rastegar, Hazel Doughty, Cees G M Snoek: Learn to Categorize or Categorize to Learn? Self-Coding for Generalized Category Discovery. In: NeurIPS, 2023. @inproceedings{RastegarNeurips2023,
title = {Learn to Categorize or Categorize to Learn? Self-Coding for Generalized Category Discovery},
author = {Sarah Rastegar and Hazel Doughty and Cees G M Snoek},
url = {https://arxiv.org/abs/2310.19776
https://github.com/SarahRastegar/InfoSieve},
year = {2023},
date = {2023-09-23},
urldate = {2023-09-23},
booktitle = {NeurIPS},
abstract = {In the quest for unveiling novel categories at test time, we confront the inherent limitations of traditional supervised recognition models that are restricted by a predefined category set. While strides have been made in the realms of self-supervised and open-world learning towards test-time category discovery, a crucial yet often overlooked question persists: what exactly delineates a category? In this paper, we conceptualize a category through the lens of optimization, viewing it as an optimal solution to a well-defined problem. Harnessing this unique conceptualization, we propose a novel, efficient and self-supervised method capable of discovering previously unknown categories at test time. A salient feature of our approach is the assignment of minimum length category codes to individual data instances, which encapsulates the implicit category hierarchy prevalent in real-world datasets. This mechanism affords us enhanced control over category granularity, thereby equipping our model to handle fine-grained categories adeptly. Experimental evaluations, bolstered by state-of-the-art benchmark comparisons, testify to the efficacy of our solution in managing unknown categories at test time. Furthermore, we fortify our proposition with a theoretical foundation, providing proof of its optimality. },
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In the quest for unveiling novel categories at test time, we confront the inherent limitations of traditional supervised recognition models that are restricted by a predefined category set. While strides have been made in the realms of self-supervised and open-world learning towards test-time category discovery, a crucial yet often overlooked question persists: what exactly delineates a category? In this paper, we conceptualize a category through the lens of optimization, viewing it as an optimal solution to a well-defined problem. Harnessing this unique conceptualization, we propose a novel, efficient and self-supervised method capable of discovering previously unknown categories at test time. A salient feature of our approach is the assignment of minimum length category codes to individual data instances, which encapsulates the implicit category hierarchy prevalent in real-world datasets. This mechanism affords us enhanced control over category granularity, thereby equipping our model to handle fine-grained categories adeptly. Experimental evaluations, bolstered by state-of-the-art benchmark comparisons, testify to the efficacy of our solution in managing unknown categories at test time. Furthermore, we fortify our proposition with a theoretical foundation, providing proof of its optimality. |
| Yunhua Zhang, Hazel Doughty, Cees G M Snoek: Learning Unseen Modality Interaction. In: NeurIPS, 2023. @inproceedings{ZhangNeurips2023,
title = {Learning Unseen Modality Interaction},
author = {Yunhua Zhang and Hazel Doughty and Cees G M Snoek},
url = {https://arxiv.org/abs/2306.12795
https://xiaobai1217.github.io/Unseen-Modality-Interaction/},
year = {2023},
date = {2023-09-22},
urldate = {2023-09-22},
booktitle = {NeurIPS},
abstract = {Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved. This allows the information to be accumulated with a simple summation operation across available modalities. To reduce overfitting to less discriminative modality combinations during training, we further improve the model learning with pseudo-supervision indicating the reliability of a modality's prediction. We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval.},
howpublished = {arXiv:2306.12795},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved. This allows the information to be accumulated with a simple summation operation across available modalities. To reduce overfitting to less discriminative modality combinations during training, we further improve the model learning with pseudo-supervision indicating the reliability of a modality's prediction. We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval. |
| Mohammad Mahdi Derakhshani, Enrique Sanchez, Adrian Bulat, Victor Guilherme Turrisi da Costa, Cees G M Snoek, Georgios Tzimiropoulos, Brais Martinez: Bayesian Prompt Learning for Image-Language Model Generalization. In: ICCV, 2023. @inproceedings{DerakhshaniICCV2023,
title = {Bayesian Prompt Learning for Image-Language Model Generalization},
author = {Mohammad Mahdi Derakhshani and Enrique Sanchez and Adrian Bulat and Victor Guilherme Turrisi da Costa and Cees G M Snoek and Georgios Tzimiropoulos and Brais Martinez},
url = {https://arxiv.org/abs/2210.02390},
year = {2023},
date = {2023-07-14},
urldate = {2023-03-14},
booktitle = {ICCV},
abstract = {Foundational image-language models have generated considerable interest due to their efficient adaptation to downstream tasks by prompt learning. Prompt learning treats part of the language model input as trainable while freezing the rest, and optimizes an Empirical Risk Minimization objective. However, Empirical Risk Minimization is known to suffer from distributional shifts which hurt generalizability to prompts unseen during training. By leveraging the regularization ability of Bayesian methods, we frame prompt learning from the Bayesian perspective and formulate it as a variational inference problem. Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts. Our framework is implemented by modeling the input prompt space in a probabilistic manner, as an a priori distribution which makes our proposal compatible with prompt learning approaches that are unconditional or conditional on the image. We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space, prevents learning spurious features, and exploits transferable invariant features. This results in better generalization of unseen prompts, even across different datasets and domains.},
howpublished = {arXiv:2210.02390},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Foundational image-language models have generated considerable interest due to their efficient adaptation to downstream tasks by prompt learning. Prompt learning treats part of the language model input as trainable while freezing the rest, and optimizes an Empirical Risk Minimization objective. However, Empirical Risk Minimization is known to suffer from distributional shifts which hurt generalizability to prompts unseen during training. By leveraging the regularization ability of Bayesian methods, we frame prompt learning from the Bayesian perspective and formulate it as a variational inference problem. Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts. Our framework is implemented by modeling the input prompt space in a probabilistic manner, as an a priori distribution which makes our proposal compatible with prompt learning approaches that are unconditional or conditional on the image. We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space, prevents learning spurious features, and exploits transferable invariant features. This results in better generalization of unseen prompts, even across different datasets and domains. |
| Aritra Bhowmik, Martin R Oswald, Yu Wang, Nora Baka, Cees G M Snoek: Detecting Objects with Graph Priors and Graph Refinement. In: ICCV, 2023. @inproceedings{BhowmikICCV2023,
title = {Detecting Objects with Graph Priors and Graph Refinement},
author = {Aritra Bhowmik and Martin R Oswald and Yu Wang and Nora Baka and Cees G M Snoek},
url = {https://arxiv.org/abs/2212.12395},
year = {2023},
date = {2023-07-14},
urldate = {2022-12-23},
booktitle = {ICCV},
abstract = {The goal of this paper is to detect objects by exploiting their interrelationships. Rather than relying on predefined and labeled graph structures, we infer a graph prior from object co-occurrence statistics. The key idea of our paper is to model object relations as a function of initial class predictions and co-occurrence priors to generate a graph representation of an image for improved classification and bounding box regression. We additionally learn the object-relation joint distribution via energy based modeling. Sampling from this distribution generates a refined graph representation of the image which in turn produces improved detection performance. Experiments on the Visual Genome and MS-COCO datasets demonstrate our method is detector agnostic, end-to-end trainable, and especially beneficial for rare object classes. What is more, we establish a consistent improvement over object detectors like DETR and Faster-RCNN, as well as state-of-the-art methods modeling object interrelationships.},
howpublished = {arXiv:2212.12395},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
The goal of this paper is to detect objects by exploiting their interrelationships. Rather than relying on predefined and labeled graph structures, we infer a graph prior from object co-occurrence statistics. The key idea of our paper is to model object relations as a function of initial class predictions and co-occurrence priors to generate a graph representation of an image for improved classification and bounding box regression. We additionally learn the object-relation joint distribution via energy based modeling. Sampling from this distribution generates a refined graph representation of the image which in turn produces improved detection performance. Experiments on the Visual Genome and MS-COCO datasets demonstrate our method is detector agnostic, end-to-end trainable, and especially beneficial for rare object classes. What is more, we establish a consistent improvement over object detectors like DETR and Faster-RCNN, as well as state-of-the-art methods modeling object interrelationships. |
| Fida Mohammad Thoker, Hazel Doughty, Cees G M Snoek: Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization. In: ICCV, 2023. @inproceedings{ThokerICCV2023,
title = {Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization},
author = {Fida Mohammad Thoker and Hazel Doughty and Cees G M Snoek},
url = {https://arxiv.org/abs/2303.11003},
year = {2023},
date = {2023-07-14},
urldate = {2023-03-20},
booktitle = {ICCV},
abstract = {We propose a self-supervised method for learning motion-focused video representations. Existing approaches minimize distances between temporally augmented videos, which maintain high spatial similarity. We instead propose to learn similarities between videos with identical local motion dynamics but an otherwise different appearance. We do so by adding synthetic motion trajectories to videos which we refer to as tubelets. By simulating different tubelet motions and applying transformations, such as scaling and rotation, we introduce motion patterns beyond what is present in the pretraining data. This allows us to learn a video representation that is remarkably data-efficient: our approach maintains performance when using only 25% of the pretraining videos. Experiments on 10 diverse downstream settings demonstrate our competitive performance and generalizability to new domains and fine-grained actions},
howpublished = {arXiv:2303.11003},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We propose a self-supervised method for learning motion-focused video representations. Existing approaches minimize distances between temporally augmented videos, which maintain high spatial similarity. We instead propose to learn similarities between videos with identical local motion dynamics but an otherwise different appearance. We do so by adding synthetic motion trajectories to videos which we refer to as tubelets. By simulating different tubelet motions and applying transformations, such as scaling and rotation, we introduce motion patterns beyond what is present in the pretraining data. This allows us to learn a video representation that is remarkably data-efficient: our approach maintains performance when using only 25% of the pretraining videos. Experiments on 10 diverse downstream settings demonstrate our competitive performance and generalizability to new domains and fine-grained actions |
| Pengwan Yang, Cees G M Snoek, Yuki M Asano: Self-Ordering Point Clouds. In: ICCV, 2023, (Oral presentation). @inproceedings{YangICCV2023,
title = {Self-Ordering Point Clouds},
author = {Pengwan Yang and Cees G M Snoek and Yuki M Asano},
url = {https://arxiv.org/abs/2304.00961},
year = {2023},
date = {2023-07-14},
urldate = {2023-07-14},
booktitle = {ICCV},
abstract = {In this paper we address the task of finding representative subsets of points in a 3D point cloud by means of a point-wise ordering. Only a few works have tried to address this challenging vision problem, all with the help of hard to obtain point and cloud labels. Different from these works, we introduce the task of point-wise ordering in 3D point clouds through self-supervision, which we call self-ordering. We further contribute the first end-to-end trainable network that learns a point-wise ordering in a self-supervised fashion. It utilizes a novel differentiable point scoring-sorting strategy and it constructs an hierarchical contrastive scheme to obtain self-supervision signals. We extensively ablate the method and show its scalability and superior performance even compared to supervised ordering methods on multiple datasets and tasks including zero-shot ordering of point clouds from unseen categories.},
howpublished = {arXiv:2304.00961},
note = {Oral presentation},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper we address the task of finding representative subsets of points in a 3D point cloud by means of a point-wise ordering. Only a few works have tried to address this challenging vision problem, all with the help of hard to obtain point and cloud labels. Different from these works, we introduce the task of point-wise ordering in 3D point clouds through self-supervision, which we call self-ordering. We further contribute the first end-to-end trainable network that learns a point-wise ordering in a self-supervised fashion. It utilizes a novel differentiable point scoring-sorting strategy and it constructs an hierarchical contrastive scheme to obtain self-supervision signals. We extensively ablate the method and show its scalability and superior performance even compared to supervised ordering methods on multiple datasets and tasks including zero-shot ordering of point clouds from unseen categories. |
| Mengmeng Jing, Xiantong Zhen, Jingjing Li, Cees G M Snoek: Order-preserving Consistency Regularization for Domain Adaptation and Generalization. In: ICCV, 2023. @inproceedings{JingICCV2023,
title = {Order-preserving Consistency Regularization for Domain Adaptation and Generalization},
author = {Mengmeng Jing and Xiantong Zhen and Jingjing Li and Cees G M Snoek},
url = {https://arxiv.org/abs/2309.13258},
year = {2023},
date = {2023-07-14},
urldate = {2023-07-14},
booktitle = {ICCV},
abstract = {Deep learning models fail on cross-domain challenges if the model is oversensitive to domain-specific attributes, e.g., lightning, background, camera angle, etc. To alleviate this problem, data augmentation coupled with consistency regularization are commonly adopted to make the model less sensitive to domain-specific attributes. Consistency regularization enforces the model to output the same representation or prediction for two views of one image. These constraints, however, are either too strict or not order-preserving for the classification probabilities. In this work, we propose the Order-preserving Consistency Regularization (OCR) for cross-domain tasks. The order-preserving property for the prediction makes the model robust to task-irrelevant transformations. As a result, the model becomes less sensitive to the domain-specific attributes. The comprehensive experiments show that our method achieves clear advantages on five different cross-domain tasks.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Deep learning models fail on cross-domain challenges if the model is oversensitive to domain-specific attributes, e.g., lightning, background, camera angle, etc. To alleviate this problem, data augmentation coupled with consistency regularization are commonly adopted to make the model less sensitive to domain-specific attributes. Consistency regularization enforces the model to output the same representation or prediction for two views of one image. These constraints, however, are either too strict or not order-preserving for the classification probabilities. In this work, we propose the Order-preserving Consistency Regularization (OCR) for cross-domain tasks. The order-preserving property for the prediction makes the model robust to task-irrelevant transformations. As a result, the model becomes less sensitive to the domain-specific attributes. The comprehensive experiments show that our method achieves clear advantages on five different cross-domain tasks. |
| Mohammadreza Salehi, Efstratios Gavves, Cees G M Snoek, Yuki M Asano: Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations. In: ICCV, 2023. @inproceedings{SalehiICCV2023,
title = {Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations},
author = {Mohammadreza Salehi and Efstratios Gavves and Cees G M Snoek and Yuki M Asano},
url = {https://arxiv.org/abs/2308.11796
https://github.com/SMSD75/Timetuning},
year = {2023},
date = {2023-07-14},
urldate = {2023-07-14},
booktitle = {ICCV},
abstract = {Spatially dense self-supervised learning is a rapidly growing problem domain with promising applications for unsupervised segmentation and pretraining for dense downstream tasks. Despite the abundance of temporal data in the form of videos, this information-rich source has been largely overlooked. Our paper aims to address this gap by proposing a novel approach that incorporates temporal consistency in dense self-supervised learning. While methods designed solely for images face difficulties in achieving even the same performance on videos, our method improves not only the representation quality for videos-but also images. Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos. This effectively facilitates the transfer of high-level information from videos to image representations. Time-tuning improves the state-of-the-art by 8-10% for unsupervised semantic segmentation on videos and matches it for images. We believe this method paves the way for further self-supervised scaling by leveraging the abundant availability of videos.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Spatially dense self-supervised learning is a rapidly growing problem domain with promising applications for unsupervised segmentation and pretraining for dense downstream tasks. Despite the abundance of temporal data in the form of videos, this information-rich source has been largely overlooked. Our paper aims to address this gap by proposing a novel approach that incorporates temporal consistency in dense self-supervised learning. While methods designed solely for images face difficulties in achieving even the same performance on videos, our method improves not only the representation quality for videos-but also images. Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos. This effectively facilitates the transfer of high-level information from videos to image representations. Time-tuning improves the state-of-the-art by 8-10% for unsupervised semantic segmentation on videos and matches it for images. We believe this method paves the way for further self-supervised scaling by leveraging the abundant availability of videos. |
| Tom van Sonsbeek, Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Cees G M Snoek, Marcel Worring: Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models. In: MICCAI, 2023, (Oral presentation). @inproceedings{SonsbeekMICCAI2023,
title = {Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models},
author = {Tom van Sonsbeek and Mohammad Mahdi Derakhshani and Ivona Najdenkoska and Cees G M Snoek and Marcel Worring},
url = {https://arxiv.org/abs/2303.05977},
year = {2023},
date = {2023-06-24},
urldate = {2023-06-24},
booktitle = {MICCAI},
abstract = {Medical Visual Question Answering (VQA) is an important challenge, as it would lead to faster and more accurate diagnoses and treatment decisions. Most existing methods approach it as a multi-class classification problem, which restricts the outcome to a predefined closed-set of curated answers. We focus on open-ended VQA and motivated by the recent advances in language models consider it as a generative task. Leveraging pre-trained language models, we introduce a novel method particularly suited for small, domain-specific, medical datasets. To properly communicate the medical images to the language model, we develop a network that maps the extracted visual features to a set of learnable tokens. Then, alongside the question, these learnable tokens directly prompt the language model. We explore recent parameter-efficient fine-tuning strategies for language models, which allow for resource- and data-efficient fine-tuning. We evaluate our approach on the prime medical VQA benchmarks, namely, Slake, OVQA and PathVQA. The results demonstrate that our approach outperforms existing methods across various training settings while also being computationally efficient.},
note = {Oral presentation},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Medical Visual Question Answering (VQA) is an important challenge, as it would lead to faster and more accurate diagnoses and treatment decisions. Most existing methods approach it as a multi-class classification problem, which restricts the outcome to a predefined closed-set of curated answers. We focus on open-ended VQA and motivated by the recent advances in language models consider it as a generative task. Leveraging pre-trained language models, we introduce a novel method particularly suited for small, domain-specific, medical datasets. To properly communicate the medical images to the language model, we develop a network that maps the extracted visual features to a set of learnable tokens. Then, alongside the question, these learnable tokens directly prompt the language model. We explore recent parameter-efficient fine-tuning strategies for language models, which allow for resource- and data-efficient fine-tuning. We evaluate our approach on the prime medical VQA benchmarks, namely, Slake, OVQA and PathVQA. The results demonstrate that our approach outperforms existing methods across various training settings while also being computationally efficient. |
| Tao Hu, William Thong, Pascal Mettes, Cees G M Snoek: Query by Activity Video in the Wild. In: ICIP, 2023. @inproceedings{HuICIP2023,
title = {Query by Activity Video in the Wild},
author = {Tao Hu and William Thong and Pascal Mettes and Cees G M Snoek},
year = {2023},
date = {2023-06-21},
urldate = {2023-06-21},
booktitle = {ICIP},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
|
| Haoliang Sun, Xiankai Lu, Haochen Wang, Yilong Yin, Xiantong Zhen, Cees G M Snoek, Ling Shao: Attentional Prototype Inference for Few-Shot Segmentation. In: Pattern Recognition, vol. 142, 2023. @article{SunPR23,
title = {Attentional Prototype Inference for Few-Shot Segmentation},
author = {Haoliang Sun and Xiankai Lu and Haochen Wang and Yilong Yin and Xiantong Zhen and Cees G M Snoek and Ling Shao},
url = {https://arxiv.org/abs/2105.06668},
year = {2023},
date = {2023-05-29},
urldate = {2021-04-30},
journal = {Pattern Recognition},
volume = {142},
abstract = {This paper aims to address few-shot segmentation. While existing prototype-based methods have achieved considerable success, they suffer from uncertainty and ambiguity caused by limited labeled examples. In this work, we propose attentional prototype inference (API), a probabilistic latent variable framework for few-shot segmentation. We define a global latent variable to represent the prototype of each object category, which we model as a probabilistic distribution. The probabilistic modeling of the prototype enhances the model’s generalization ability by handling the inherent uncertainty caused by limited data and intra-class variations of objects. To further enhance the model, we introduce a local latent variable to represent the attention map of each query image, which enables the model to attend to foreground objects while suppressing the background. The optimization of the proposed model is formulated as a variational Bayesian inference problem, which is established by amortized inference networks. We conduct extensive experiments on four benchmarks, where our proposal obtains at least competitive and often better performance than state-of-the-art prototype-based methods. We also provide comprehensive analyses and ablation studies to gain insight into the effectiveness of our method for few-shot segmentation.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
This paper aims to address few-shot segmentation. While existing prototype-based methods have achieved considerable success, they suffer from uncertainty and ambiguity caused by limited labeled examples. In this work, we propose attentional prototype inference (API), a probabilistic latent variable framework for few-shot segmentation. We define a global latent variable to represent the prototype of each object category, which we model as a probabilistic distribution. The probabilistic modeling of the prototype enhances the model’s generalization ability by handling the inherent uncertainty caused by limited data and intra-class variations of objects. To further enhance the model, we introduce a local latent variable to represent the attention map of each query image, which enables the model to attend to foreground objects while suppressing the background. The optimization of the proposed model is formulated as a variational Bayesian inference problem, which is established by amortized inference networks. We conduct extensive experiments on four benchmarks, where our proposal obtains at least competitive and often better performance than state-of-the-art prototype-based methods. We also provide comprehensive analyses and ablation studies to gain insight into the effectiveness of our method for few-shot segmentation. |
| Yingjun Du, Jiayi Shen, Xiantong Zhen, Cees G M Snoek: EMO: Episodic Memory Optimization for Few-Shot Meta-Learning. In: CoLLAs, 2023, (Oral presentation, top 12 papers.). @inproceedings{DuColla2023,
title = {EMO: Episodic Memory Optimization for Few-Shot Meta-Learning},
author = {Yingjun Du and Jiayi Shen and Xiantong Zhen and Cees G M Snoek},
url = {https://arxiv.org/abs/2306.05189},
year = {2023},
date = {2023-05-16},
urldate = {2023-05-16},
booktitle = {CoLLAs},
abstract = {For few-shot meta-learning, gradient descent optimization is challenging due to the limited number of training samples per task. Inspired by the human ability to recall past learning experiences from the brain's memory, we propose an episodic memory optimization for meta learning, which we call EMO, that retains the gradient history of past experienced tasks in external memory. It enables few-shot learning in a memory-augmented way by leveraging the meta-learning setting and learns to retain and recall the learning process of past training tasks for gradient descent optimization. By doing so, EMO nudges the parameter updates in the right direction, even when the gradients provided by a limited number of examples are uninformative. Additionally, we prove theoretically that our algorithm converges for smooth, strongly convex objectives. EMO is generic, flexible, and model-agnostic, making it a simple plug-and-play optimizer seamlessly embedded into existing optimization-based meta-learning approaches. Empirically, EMO scales well with most of the few-shot classification benchmarks, and our experiments show that the optimization-based meta-learning method enjoys accelerated convergence and improved performance with EMO. },
note = {Oral presentation, top 12 papers.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
For few-shot meta-learning, gradient descent optimization is challenging due to the limited number of training samples per task. Inspired by the human ability to recall past learning experiences from the brain's memory, we propose an episodic memory optimization for meta learning, which we call EMO, that retains the gradient history of past experienced tasks in external memory. It enables few-shot learning in a memory-augmented way by leveraging the meta-learning setting and learns to retain and recall the learning process of past training tasks for gradient descent optimization. By doing so, EMO nudges the parameter updates in the right direction, even when the gradients provided by a limited number of examples are uninformative. Additionally, we prove theoretically that our algorithm converges for smooth, strongly convex objectives. EMO is generic, flexible, and model-agnostic, making it a simple plug-and-play optimizer seamlessly embedded into existing optimization-based meta-learning approaches. Empirically, EMO scales well with most of the few-shot classification benchmarks, and our experiments show that the optimization-based meta-learning method enjoys accelerated convergence and improved performance with EMO. |
| Wenfang Sun, Yingjun Du, Xiantong Zhen, Fan Wang, Ling Wang, Cees G M Snoek: MetaModulation: Learning Variational Feature Hierarchies for Few-Shot Learning with Fewer Tasks. In: ICML, 2023. @inproceedings{SunICML2023,
title = {MetaModulation: Learning Variational Feature Hierarchies for Few-Shot Learning with Fewer Tasks},
author = {Wenfang Sun and Yingjun Du and Xiantong Zhen and Fan Wang and Ling Wang and Cees G M Snoek},
url = {https://arxiv.org/abs/2305.10309
https://github.com/lmsdss/MetaModulation},
year = {2023},
date = {2023-04-25},
urldate = {2023-04-25},
booktitle = {ICML},
abstract = {Meta-learning algorithms are able to learn a new task using previously learned knowledge, but they often require a large number of meta-training tasks which may not be readily available. To address this issue, we propose a method for few-shot learning with fewer tasks, which we call MetaModulation. The key idea is to use a neural network to increase the density of the meta-training tasks by modulating batch normalization parameters during meta-training. Additionally, we modify parameters at various network levels, rather than just a single layer, to increase task diversity. To account for the uncertainty caused by the limited training tasks, we propose a variational MetaModulation where the modulation parameters are treated as latent variables. We also introduce learning variational feature hierarchies by the variational MetaModulation, which modulates features at all layers and can consider task uncertainty and generate more diverse tasks. The ablation studies illustrate the advantages of utilizing a learnable task modulation at different levels and demonstrate the benefit of incorporating probabilistic variants in few-task meta-learning. Our MetaModulation and its variational variants consistently outperform state-of-the-art alternatives on four few-task meta-learning benchmarks.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Meta-learning algorithms are able to learn a new task using previously learned knowledge, but they often require a large number of meta-training tasks which may not be readily available. To address this issue, we propose a method for few-shot learning with fewer tasks, which we call MetaModulation. The key idea is to use a neural network to increase the density of the meta-training tasks by modulating batch normalization parameters during meta-training. Additionally, we modify parameters at various network levels, rather than just a single layer, to increase task diversity. To account for the uncertainty caused by the limited training tasks, we propose a variational MetaModulation where the modulation parameters are treated as latent variables. We also introduce learning variational feature hierarchies by the variational MetaModulation, which modulates features at all layers and can consider task uncertainty and generate more diverse tasks. The ablation studies illustrate the advantages of utilizing a learnable task modulation at different levels and demonstrate the benefit of incorporating probabilistic variants in few-task meta-learning. Our MetaModulation and its variational variants consistently outperform state-of-the-art alternatives on four few-task meta-learning benchmarks. |
| Yan Zhang, David W Zhang, Simon Lacoste-Julien, Gertjan J Burghouts, Cees G M Snoek: Unlocking Slot Attention by Changing Optimal Transport Costs. In: ICML, 2023. @inproceedings{ZhangICML2023,
title = {Unlocking Slot Attention by Changing Optimal Transport Costs},
author = {Yan Zhang and David W Zhang and Simon Lacoste-Julien and Gertjan J Burghouts and Cees G M Snoek},
url = {https://arxiv.org/abs/2301.13197
https://github.com/davzha/MESH},
year = {2023},
date = {2023-04-24},
urldate = {2023-04-24},
booktitle = {ICML},
abstract = {Slot attention is a powerful method for object-centric modeling in images and videos. However, its set-equivariance limits its ability to handle videos with a dynamic number of objects because it cannot break ties. To overcome this limitation, we first establish a connection between slot attention and optimal transport. Based on this new perspective we propose MESH (Minimize Entropy of Sinkhorn): a cross-attention module that combines the tiebreaking properties of unregularized optimal transport with the speed of regularized optimal transport. We evaluate slot attention using MESH on multiple object-centric learning benchmarks and find significant improvements over slot attention in every setting.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Slot attention is a powerful method for object-centric modeling in images and videos. However, its set-equivariance limits its ability to handle videos with a dynamic number of objects because it cannot break ties. To overcome this limitation, we first establish a connection between slot attention and optimal transport. Based on this new perspective we propose MESH (Minimize Entropy of Sinkhorn): a cross-attention module that combines the tiebreaking properties of unregularized optimal transport with the speed of regularized optimal transport. We evaluate slot attention using MESH on multiple object-centric learning benchmarks and find significant improvements over slot attention in every setting. |
| Shuo Chen, Yingjun Du, Pascal Mettes, Cees G M Snoek: Multi-Label Meta Weighting for Long-Tailed Dynamic Scene Graph Generation. In: ICMR, 2023, (Oral presentation.). @inproceedings{ChenICMR2023,
title = {Multi-Label Meta Weighting for Long-Tailed Dynamic Scene Graph Generation},
author = {Shuo Chen and Yingjun Du and Pascal Mettes and Cees G M Snoek},
url = {https://arxiv.org/abs/2306.10122},
year = {2023},
date = {2023-04-03},
urldate = {2023-04-03},
booktitle = {ICMR},
abstract = {This paper investigates the problem of scene graph generation in videos, where the goal is to capture semantic relations between subjects and objects in the form of subject, predicate, object triplets. Recognizing the predicate between subject and object pairs is imbalanced and multi-label in nature, ranging from ubiquitous interactions such as spatial relationships e.g. in_front_of to rare interactions such as twisting. In popular benchmarks such as Action Genome and VidOR, the imbalance ratio between most and least frequent predicates is 3218 and 3408, respectively, far higher even than benchmarks specifically designed to address long-tailed recognition. Due to these long-tailed distributions and label co-occurrences, recent state-of-the-art methods rely heavily on the most often occurring predicate classes, ignoring predicate classes in the long tail. In this paper, we analyze the limitations of current approaches for scene graph generation in videos and find a one-to-one correspondence between predicate frequency and recall performance. To make the step towards unbiased scene graph generation in videos, we introduce a multi-label meta-learning framework to deal with the biased predicate distribution. Our meta-learning framework learns a meta-weight network for each training sample over all possible label losses. We evaluate our approach on the Action Genome and VidOR benchmarks by building on two current state-of-the-art methods for each benchmark. The experiments confirm that our multi-label meta-weight network improves the performance for predicates in the long tail without hampering performance for head classes, resulting in better overall performance and favorable generalizability.},
note = {Oral presentation.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper investigates the problem of scene graph generation in videos, where the goal is to capture semantic relations between subjects and objects in the form of subject, predicate, object triplets. Recognizing the predicate between subject and object pairs is imbalanced and multi-label in nature, ranging from ubiquitous interactions such as spatial relationships e.g. in_front_of to rare interactions such as twisting. In popular benchmarks such as Action Genome and VidOR, the imbalance ratio between most and least frequent predicates is 3218 and 3408, respectively, far higher even than benchmarks specifically designed to address long-tailed recognition. Due to these long-tailed distributions and label co-occurrences, recent state-of-the-art methods rely heavily on the most often occurring predicate classes, ignoring predicate classes in the long tail. In this paper, we analyze the limitations of current approaches for scene graph generation in videos and find a one-to-one correspondence between predicate frequency and recall performance. To make the step towards unbiased scene graph generation in videos, we introduce a multi-label meta-learning framework to deal with the biased predicate distribution. Our meta-learning framework learns a meta-weight network for each training sample over all possible label losses. We evaluate our approach on the Action Genome and VidOR benchmarks by building on two current state-of-the-art methods for each benchmark. The experiments confirm that our multi-label meta-weight network improves the performance for predicates in the long tail without hampering performance for head classes, resulting in better overall performance and favorable generalizability. |
| Vincent Tao Hu, David W Zhang, Yuki M Asano, Gertjan J Burghouts, Cees G M Snoek: Self-Guided Diffusion Models. In: CVPR, 2023. @inproceedings{HuCVPR2023,
title = {Self-Guided Diffusion Models},
author = {Vincent Tao Hu and David W Zhang and Yuki M Asano and Gertjan J Burghouts and Cees G M Snoek},
url = {https://arxiv.org/abs/2210.06462
http://taohu.me/sgdm/},
year = {2023},
date = {2023-02-28},
urldate = {2023-02-28},
booktitle = {CVPR},
abstract = {Diffusion models have demonstrated remarkable progress in image generation quality, especially when guidance is used to control the generative process. However, guidance requires a large amount of image-annotation pairs for training and is thus dependent on their availability, correctness and unbiasedness. In this paper, we eliminate the need for such annotation by instead leveraging the flexibility of self-supervision signals to design a framework for self-guided diffusion models. By leveraging a feature extraction function and a self-annotation function, our method provides guidance signals at various image granularities: from the level of holistic images to object boxes and even segmentation masks. Our experiments on single-label and multi-label image datasets demonstrate that self-labeled guidance always outperforms diffusion models without guidance and may even surpass guidance based on ground-truth labels, especially on unbalanced data. When equipped with self-supervised box or mask proposals, our method further generates visually diverse yet semantically consistent images, without the need for any class, box, or segment label annotation. Self-guided diffusion is simple, flexible and expected to profit from deployment at scale.},
howpublished = {arXiv:2210.06462},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Diffusion models have demonstrated remarkable progress in image generation quality, especially when guidance is used to control the generative process. However, guidance requires a large amount of image-annotation pairs for training and is thus dependent on their availability, correctness and unbiasedness. In this paper, we eliminate the need for such annotation by instead leveraging the flexibility of self-supervision signals to design a framework for self-guided diffusion models. By leveraging a feature extraction function and a self-annotation function, our method provides guidance signals at various image granularities: from the level of holistic images to object boxes and even segmentation masks. Our experiments on single-label and multi-label image datasets demonstrate that self-labeled guidance always outperforms diffusion models without guidance and may even surpass guidance based on ground-truth labels, especially on unbalanced data. When equipped with self-supervised box or mask proposals, our method further generates visually diverse yet semantically consistent images, without the need for any class, box, or segment label annotation. Self-guided diffusion is simple, flexible and expected to profit from deployment at scale. |
| Piyush Bagad, Makarand Tapaswi, Cees G M Snoek: Test of Time: Instilling Video-Language Models with a Sense of Time. In: CVPR, 2023. @inproceedings{BagadCVPR2023,
title = {Test of Time: Instilling Video-Language Models with a Sense of Time},
author = {Piyush Bagad and Makarand Tapaswi and Cees G M Snoek},
url = {https://arxiv.org/abs/2301.02074
https://bpiyush.github.io/testoftime-website/
https://github.com/bpiyush/TestOfTime},
year = {2023},
date = {2023-02-28},
urldate = {2023-02-28},
booktitle = {CVPR},
abstract = {Modeling and understanding time remains a challenge in contemporary video understanding models. With language emerging as a key driver towards powerful generalization, it is imperative for foundational video-language models to have a sense of time. In this paper, we consider a specific aspect of temporal understanding: consistency of time order as elicited by before/after relations. We establish that six existing video-language models struggle to understand even such simple temporal relations. We then question whether it is feasible to equip these foundational models with temporal awareness without re-training them from scratch. Towards this, we propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data. We conduct a zero-shot evaluation of the adapted models on six datasets for three downstream tasks which require a varying degree of time awareness. We observe encouraging performance gains especially when the task needs higher time awareness. Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch.},
howpublished = {arXiv:2301.02074},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Modeling and understanding time remains a challenge in contemporary video understanding models. With language emerging as a key driver towards powerful generalization, it is imperative for foundational video-language models to have a sense of time. In this paper, we consider a specific aspect of temporal understanding: consistency of time order as elicited by before/after relations. We establish that six existing video-language models struggle to understand even such simple temporal relations. We then question whether it is feasible to equip these foundational models with temporal awareness without re-training them from scratch. Towards this, we propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data. We conduct a zero-shot evaluation of the adapted models on six datasets for three downstream tasks which require a varying degree of time awareness. We observe encouraging performance gains especially when the task needs higher time awareness. Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch. |
| Yingjun Du, Jiayi Shen, Xiantong Zhen, Cees G M Snoek: SuperDisco: Super-Class Discovery Improves Visual Recognition for the Long-Tail. In: CVPR, 2023. @inproceedings{DuCVPR2023,
title = {SuperDisco: Super-Class Discovery Improves Visual Recognition for the Long-Tail},
author = {Yingjun Du and Jiayi Shen and Xiantong Zhen and Cees G M Snoek},
url = {https://arxiv.org/abs/2304.00101},
year = {2023},
date = {2023-02-28},
urldate = {2023-02-28},
booktitle = {CVPR},
abstract = {Modern image classifiers perform well on populated classes, while degrading considerably on tail classes with only a few instances. Humans, by contrast, effortlessly handle the long-tailed recognition challenge, since they can learn the tail representation based on different levels of semantic abstraction, making the learned tail features more discriminative. This phenomenon motivated us to propose SuperDisco, an algorithm that discovers super-class representations for long-tailed recognition using a graph model. We learn to construct the super-class graph to guide the representation learning to deal with long-tailed distributions. Through message passing on the super-class graph, image representations are rectified and refined by attending to the most relevant entities based on the semantic similarity among their super-classes. Moreover, we propose to meta-learn the super-class graph under the supervision of a prototype graph constructed from a small amount of imbalanced data. By doing so, we obtain a more robust super-class graph that further improves the long-tailed recognition performance. The consistent state-of-the-art experiments on the long-tailed CIFAR-100, ImageNet, Places and iNaturalist demonstrate the benefit of the discovered super-class graph for dealing with long-tailed distributions.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Modern image classifiers perform well on populated classes, while degrading considerably on tail classes with only a few instances. Humans, by contrast, effortlessly handle the long-tailed recognition challenge, since they can learn the tail representation based on different levels of semantic abstraction, making the learned tail features more discriminative. This phenomenon motivated us to propose SuperDisco, an algorithm that discovers super-class representations for long-tailed recognition using a graph model. We learn to construct the super-class graph to guide the representation learning to deal with long-tailed distributions. Through message passing on the super-class graph, image representations are rectified and refined by attending to the most relevant entities based on the semantic similarity among their super-classes. Moreover, we propose to meta-learn the super-class graph under the supervision of a prototype graph constructed from a small amount of imbalanced data. By doing so, we obtain a more robust super-class graph that further improves the long-tailed recognition performance. The consistent state-of-the-art experiments on the long-tailed CIFAR-100, ImageNet, Places and iNaturalist demonstrate the benefit of the discovered super-class graph for dealing with long-tailed distributions. |
| Hossein Mirzaei, Mohammadreza Salehi, Sajjad Shahabi, Efstratios Gavves, Cees G M Snoek, Mohammad Sabokrou, Mohammad Hossein Rohban: Fake It Till You Make It: Towards Accurate Near-Distribution Novelty Detection. In: ICLR, 2023. @inproceedings{MirzaeiICLR2023,
title = {Fake It Till You Make It: Towards Accurate Near-Distribution Novelty Detection},
author = {Hossein Mirzaei and Mohammadreza Salehi and Sajjad Shahabi and Efstratios Gavves and Cees G M Snoek and Mohammad Sabokrou and Mohammad Hossein Rohban},
url = {https://arxiv.org/abs/2205.14297},
year = {2023},
date = {2023-01-21},
urldate = {2022-01-21},
booktitle = {ICLR},
abstract = {We aim for image-based novelty detection. Despite considerable progress, existing models either fail or face a dramatic drop under the so-called "near-distribution" setting, where the differences between normal and anomalous samples are subtle. We first demonstrate existing methods experience up to 20% decrease in performance in the near-distribution setting. Next, we propose to exploit a score-based generative model to produce synthetic near-distribution anomalous data. Our model is then fine-tuned to distinguish such data from the normal samples. We provide a quantitative as well as qualitative evaluation of this strategy, and compare the results with a variety of GAN-based models. Effectiveness of our method for both the near-distribution and standard novelty detection is assessed through extensive experiments on datasets in diverse applications such as medical images, object classification, and quality control. This reveals that our method considerably improves over existing models, and consistently decreases the gap between the near-distribution and standard novelty detection performance. The code repository is available at https://github.com/rohban-lab/FITYMI.},
howpublished = {arXiv:2205.14297},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We aim for image-based novelty detection. Despite considerable progress, existing models either fail or face a dramatic drop under the so-called "near-distribution" setting, where the differences between normal and anomalous samples are subtle. We first demonstrate existing methods experience up to 20% decrease in performance in the near-distribution setting. Next, we propose to exploit a score-based generative model to produce synthetic near-distribution anomalous data. Our model is then fine-tuned to distinguish such data from the normal samples. We provide a quantitative as well as qualitative evaluation of this strategy, and compare the results with a variety of GAN-based models. Effectiveness of our method for both the near-distribution and standard novelty detection is assessed through extensive experiments on datasets in diverse applications such as medical images, object classification, and quality control. This reveals that our method considerably improves over existing models, and consistently decreases the gap between the near-distribution and standard novelty detection performance. The code repository is available at https://github.com/rohban-lab/FITYMI. |
| Zehao Xiao, Xiantong Zhen, Shengcai Liao, Cees G M Snoek: Energy-Based Test Sample Adaptation for Domain Generalization. In: ICLR, 2023. @inproceedings{XiaoICLR2023,
title = {Energy-Based Test Sample Adaptation for Domain Generalization},
author = {Zehao Xiao and Xiantong Zhen and Shengcai Liao and Cees G M Snoek},
url = {https://arxiv.org/abs/2302.11215
https://github.com/zzzx1224/EBTSA-ICLR2023},
year = {2023},
date = {2023-01-21},
urldate = {2023-01-21},
booktitle = {ICLR},
abstract = {In this paper, we propose energy-based sample adaptation at test time for domain generalization. Where previous works adapt their models to target domains, we adapt the unseen target samples to source-trained models. To this end, we design a discriminative energy-based model, which is trained on source domains to jointly model the conditional distribution for classification and data distribution for sample adaptation. The model is optimized to simultaneously learn a classifier and an energy function. To adapt target samples to source distributions, we iteratively update the samples by energy minimization with stochastic gradient Langevin dynamics. Moreover, to preserve the categorical information in the sample during adaptation, we introduce a categorical latent variable into the energy-based model. The latent variable is learned from the original sample before adaptation by variational inference and fixed as a condition to guide the sample update. Experiments on six benchmarks for classification of images and microblog threads demonstrate the effectiveness of our proposal. },
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper, we propose energy-based sample adaptation at test time for domain generalization. Where previous works adapt their models to target domains, we adapt the unseen target samples to source-trained models. To this end, we design a discriminative energy-based model, which is trained on source domains to jointly model the conditional distribution for classification and data distribution for sample adaptation. The model is optimized to simultaneously learn a classifier and an energy function. To adapt target samples to source distributions, we iteratively update the samples by energy minimization with stochastic gradient Langevin dynamics. Moreover, to preserve the categorical information in the sample during adaptation, we introduce a categorical latent variable into the energy-based model. The latent variable is learned from the original sample before adaptation by variational inference and fixed as a condition to guide the sample update. Experiments on six benchmarks for classification of images and microblog threads demonstrate the effectiveness of our proposal. |
| Wim Bernasco, Evelien Hoeben, Dennis Koelma, Lasse Suonperä Liebst, Josephine Thomas, Joska Appelman, Cees Snoek, Marie Rosenkrantz Lindegaard: Promise Into Practice: Application of Computer Vision in Empirical Research on Social Distancing. In: Sociological Methods and Research, vol. 52, iss. 3, pp. 1239–1287, 2023. @article{BernascoSMR2023,
title = {Promise Into Practice: Application of Computer Vision in Empirical Research on Social Distancing},
author = {Wim Bernasco and Evelien Hoeben and Dennis Koelma and Lasse Suonperä Liebst and Josephine Thomas and Joska Appelman and Cees Snoek and Marie Rosenkrantz Lindegaard},
url = {https://osf.io/ex9fy/},
year = {2023},
date = {2023-01-01},
urldate = {2023-01-01},
journal = {Sociological Methods and Research},
volume = {52},
issue = {3},
pages = {1239–1287},
abstract = {Social scientists increasingly use video data, but large-scale analysis of its content is often constrained by scarce manual coding resources. Upscaling may be possible with the application of automated coding procedures, which are being developed in the field of computer vision. Here, we introduce computer vision to social scientists, review the state-of-the-art in relevant subfields, and provide a working example of how computer vision can be applied in empirical sociological work. Our application involves defining a ground truth by human coders, developing an algorithm for automated coding, testing the performance of the algorithm against the ground truth, and run the algorithm on a large-scale dataset of CCTV images. The working example concerns monitoring social distancing behavior in public space over more than a year of the COVID-19 pandemic. Finally, we discuss prospects for the use of computer vision in empirical social science research and address technical and ethical limitations.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Social scientists increasingly use video data, but large-scale analysis of its content is often constrained by scarce manual coding resources. Upscaling may be possible with the application of automated coding procedures, which are being developed in the field of computer vision. Here, we introduce computer vision to social scientists, review the state-of-the-art in relevant subfields, and provide a working example of how computer vision can be applied in empirical sociological work. Our application involves defining a ground truth by human coders, developing an algorithm for automated coding, testing the performance of the algorithm against the ground truth, and run the algorithm on a large-scale dataset of CCTV images. The working example concerns monitoring social distancing behavior in public space over more than a year of the COVID-19 pandemic. Finally, we discuss prospects for the use of computer vision in empirical social science research and address technical and ethical limitations. |
2022
|
| David W Zhang, Gertjan J Burghouts, Cees G M Snoek: Pruning Edges and Gradients to Learn Hypergraphs from Larger Sets. In: LoG, 2022. @inproceedings{ZhangLOG2022,
title = {Pruning Edges and Gradients to Learn Hypergraphs from Larger Sets},
author = {David W Zhang and Gertjan J Burghouts and Cees G M Snoek},
url = {https://isis-data.science.uva.nl/cgmsnoek/pub/zhang-hypergraphs-log2022.pdf
https://github.com/davzha/recurrently_predicting_hypergraphs},
year = {2022},
date = {2022-12-09},
urldate = {2022-12-09},
booktitle = {LoG},
abstract = {This paper aims for set-to-hypergraph prediction, where the goal is to infer the set of relations for a given set of entities. This is a common abstraction for applications in particle physics, biological systems and combinatorial optimization. We address two common scaling problems encountered in set-to-hypergraph tasks that limit the size of the input set: the exponentially growing number of hyperedges and the run-time complexity, both leading to higher memory requirements. We make three contributions. First, we propose to predict and supervise the positive edges only, which changes the asymptotic memory scaling from exponential to linear. Second, we introduce a training method that encourages iterative refinement of the predicted hypergraph, which allows us to skip iterations in the backward pass for improved efficiency and constant memory usage. Third, we combine both contributions in a single set-to-hypergraph model that enables us to address problems with larger input set sizes. We provide ablations for our main technical contributions and show that our model outperforms prior state-of-the-art, especially for larger sets.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper aims for set-to-hypergraph prediction, where the goal is to infer the set of relations for a given set of entities. This is a common abstraction for applications in particle physics, biological systems and combinatorial optimization. We address two common scaling problems encountered in set-to-hypergraph tasks that limit the size of the input set: the exponentially growing number of hyperedges and the run-time complexity, both leading to higher memory requirements. We make three contributions. First, we propose to predict and supervise the positive edges only, which changes the asymptotic memory scaling from exponential to linear. Second, we introduce a training method that encourages iterative refinement of the predicted hypergraph, which allows us to skip iterations in the backward pass for improved efficiency and constant memory usage. Third, we combine both contributions in a single set-to-hypergraph model that enables us to address problems with larger input set sizes. We provide ablations for our main technical contributions and show that our model outperforms prior state-of-the-art, especially for larger sets. |
| Mengmeng Jing, Xiantong Zhen, Jingjing Li, Cees G. M. Snoek: Variational Model Perturbation for Source-Free Domain Adaptation. In: NeurIPS, 2022. @inproceedings{JingNeurIPS2022,
title = {Variational Model Perturbation for Source-Free Domain Adaptation},
author = {Mengmeng Jing and Xiantong Zhen and Jingjing Li and Cees G. M. Snoek},
url = {https://github.com/mmjing/Variational_Model_Perturbation
https://arxiv.org/abs/2210.10378},
year = {2022},
date = {2022-12-01},
urldate = {2022-12-01},
booktitle = {NeurIPS},
abstract = {We aim for source-free domain adaptation, where the task is to deploy a model pre-trained on source domains to target domains. The challenges stem from the distribution shift from the source to the target domain, coupled with the unavailability of any source data and labeled target data for optimization. Rather than fine-tuning the model by updating the parameters, we propose to perturb the source model to achieve adaptation to target domains. We introduce perturbations into the model parameters by variational Bayesian inference in a probabilistic framework. By doing so, we can effectively adapt the model to the target domain while largely preserving the discriminative ability. Importantly, we demonstrate the theoretical connection to learning Bayesian neural networks, which proves the generalizability of the perturbed model to target domains. To enable more efficient optimization, we further employ a parameter sharing strategy, which substantially reduces the learnable parameters compared to a fully Bayesian neural network. Our model perturbation provides a new probabilistic way for domain adaptation which enables efficient adaptation to target domains while maximally preserving knowledge in source models. Experiments on several source-free benchmarks under three different evaluation settings verify the effectiveness of the proposed variational model perturbation for source-free domain adaptation.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We aim for source-free domain adaptation, where the task is to deploy a model pre-trained on source domains to target domains. The challenges stem from the distribution shift from the source to the target domain, coupled with the unavailability of any source data and labeled target data for optimization. Rather than fine-tuning the model by updating the parameters, we propose to perturb the source model to achieve adaptation to target domains. We introduce perturbations into the model parameters by variational Bayesian inference in a probabilistic framework. By doing so, we can effectively adapt the model to the target domain while largely preserving the discriminative ability. Importantly, we demonstrate the theoretical connection to learning Bayesian neural networks, which proves the generalizability of the perturbed model to target domains. To enable more efficient optimization, we further employ a parameter sharing strategy, which substantially reduces the learnable parameters compared to a fully Bayesian neural network. Our model perturbation provides a new probabilistic way for domain adaptation which enables efficient adaptation to target domains while maximally preserving knowledge in source models. Experiments on several source-free benchmarks under three different evaluation settings verify the effectiveness of the proposed variational model perturbation for source-free domain adaptation. |
| Jiayi Shen, Zehao Xiao, Xiantong Zhen, Cees G. M. Snoek, Marcel Worring: Association Graph Learning for Multi-Task Classification with Category Shifts. In: NeurIPS, 2022. @inproceedings{ShenNeurIPS2022,
title = {Association Graph Learning for Multi-Task Classification with Category Shifts},
author = {Jiayi Shen and Zehao Xiao and Xiantong Zhen and Cees G. M. Snoek and Marcel Worring},
url = {https://arxiv.org/abs/2210.04637
https://github.com/autumn9999/MTC-with-Category-Shifts.git},
year = {2022},
date = {2022-12-01},
urldate = {2022-12-01},
booktitle = {NeurIPS},
abstract = {In this paper, we focus on multi-task classification, where related classification tasks share the same label space and are learned simultaneously. In particular, we tackle a new setting, which is more realistic than currently addressed in the literature, where categories shift from training to test data. Hence, individual tasks do not contain complete training data for the categories in the test set. To generalize to such test data, it is crucial for individual tasks to leverage knowledge from related tasks. To this end, we propose learning an association graph to transfer knowledge among tasks for missing classes. We construct the association graph with nodes representing tasks, classes and instances, and encode the relationships among the nodes in the edges to guide their mutual knowledge transfer. By message passing on the association graph, our model enhances the categorical information of each instance, making it more discriminative. To avoid spurious correlations between task and class nodes in the graph, we introduce an assignment entropy maximization that encourages each class node to balance its edge weights. This enables all tasks to fully utilize the categorical information from related tasks. An extensive evaluation on three general benchmarks and a medical dataset for skin lesion classification reveals that our method consistently performs better than representative baselines.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper, we focus on multi-task classification, where related classification tasks share the same label space and are learned simultaneously. In particular, we tackle a new setting, which is more realistic than currently addressed in the literature, where categories shift from training to test data. Hence, individual tasks do not contain complete training data for the categories in the test set. To generalize to such test data, it is crucial for individual tasks to leverage knowledge from related tasks. To this end, we propose learning an association graph to transfer knowledge among tasks for missing classes. We construct the association graph with nodes representing tasks, classes and instances, and encode the relationships among the nodes in the edges to guide their mutual knowledge transfer. By message passing on the association graph, our model enhances the categorical information of each instance, making it more discriminative. To avoid spurious correlations between task and class nodes in the graph, we introduce an assignment entropy maximization that encourages each class node to balance its edge weights. This enables all tasks to fully utilize the categorical information from related tasks. An extensive evaluation on three general benchmarks and a medical dataset for skin lesion classification reveals that our method consistently performs better than representative baselines. |
| Fida Mohammad Thoker, Hazel Doughty, Piyush Bagad, Cees G M Snoek: How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?. In: ECCV, 2022. @inproceedings{ThokerECCV2022,
title = {How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?},
author = {Fida Mohammad Thoker and Hazel Doughty and Piyush Bagad and Cees G M Snoek},
url = {https://arxiv.org/abs/2203.14221
https://bpiyush.github.io/SEVERE-website/
https://github.com/fmthoker/SEVERE-BENCHMARK},
year = {2022},
date = {2022-10-24},
urldate = {2022-10-24},
booktitle = {ECCV},
abstract = {Despite the recent success of video self-supervised learning, there is much still to be understood about their generalization capability. In this paper, we investigate how sensitive video self-supervised learning is to the currently used benchmark convention and whether methods generalize beyond the canonical evaluation setting. We do this across four different factors of sensitivity: domain, samples, actions and task. Our comprehensive set of over 500 experiments, which encompasses 7 video datasets, 9 self-supervised methods and 6 video understanding tasks, reveals that current benchmarks in video self-supervised learning are not a good indicator of generalization along these sensitivity factors. Further, we find that self-supervised methods considerably lag behind vanilla supervised pre-training, especially when domain shift is large and the amount of available downstream samples are low. From our analysis we distill the SEVERE-benchmark, a subset of our experiments, and discuss its implication for evaluating the generalizability of representations obtained by existing and future self-supervised video learning methods.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Despite the recent success of video self-supervised learning, there is much still to be understood about their generalization capability. In this paper, we investigate how sensitive video self-supervised learning is to the currently used benchmark convention and whether methods generalize beyond the canonical evaluation setting. We do this across four different factors of sensitivity: domain, samples, actions and task. Our comprehensive set of over 500 experiments, which encompasses 7 video datasets, 9 self-supervised methods and 6 video understanding tasks, reveals that current benchmarks in video self-supervised learning are not a good indicator of generalization along these sensitivity factors. Further, we find that self-supervised methods considerably lag behind vanilla supervised pre-training, especially when domain shift is large and the amount of available downstream samples are low. From our analysis we distill the SEVERE-benchmark, a subset of our experiments, and discuss its implication for evaluating the generalizability of representations obtained by existing and future self-supervised video learning methods. |
| Pengwan Yang, Yuki M Asano, Pascal Mettes, Cees G M Snoek: Less than Few: Self-Shot Video Instance Segmentation. In: ECCV, 2022. @inproceedings{YangECCV22,
title = {Less than Few: Self-Shot Video Instance Segmentation},
author = {Pengwan Yang and Yuki M Asano and Pascal Mettes and Cees G M Snoek},
url = {https://isis-data.science.uva.nl/cgmsnoek/pub/yang-selfshot-eccv2022.pdf
https://github.com/PengWan-Yang/self-shot},
year = {2022},
date = {2022-10-24},
urldate = {2022-10-24},
booktitle = {ECCV},
abstract = {The goal of this paper is to bypass the need for labelled examples in few-shot video understanding at run time. While proven effective, in many practical video settings even labelling a few examples appears unrealistic. This is especially true as the level of details in spatio-temporal video understanding and with it, the complexity of annotations continues to increase. Rather than performing few-shot learning with a human oracle to provide a few densely labelled support videos, we propose to automatically learn to find appropriate support videos given a query. We call this self-shot learning and we outline a simple self-supervised learning method to generate an embedding space well-suited for unsupervised retrieval of relevant samples. To showcase this novel setting, we tackle, for the first time, video instance segmentation in a self-shot (and few-shot) setting, where the goal is to segment instances at the pixel-level across the spatial and temporal domains. We provide strong baseline performances that utilize a novel transformer-based model and show that self-shot learning can even surpass few-shot and can be positively combined for further performance gains. Experiments on new benchmarks show that our approach achieves strong performance, is competitive to oracle support in some settings, scales to large unlabelled video collections, and can be combined in a semi-supervised setting.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
The goal of this paper is to bypass the need for labelled examples in few-shot video understanding at run time. While proven effective, in many practical video settings even labelling a few examples appears unrealistic. This is especially true as the level of details in spatio-temporal video understanding and with it, the complexity of annotations continues to increase. Rather than performing few-shot learning with a human oracle to provide a few densely labelled support videos, we propose to automatically learn to find appropriate support videos given a query. We call this self-shot learning and we outline a simple self-supervised learning method to generate an embedding space well-suited for unsupervised retrieval of relevant samples. To showcase this novel setting, we tackle, for the first time, video instance segmentation in a self-shot (and few-shot) setting, where the goal is to segment instances at the pixel-level across the spatial and temporal domains. We provide strong baseline performances that utilize a novel transformer-based model and show that self-shot learning can even surpass few-shot and can be positively combined for further performance gains. Experiments on new benchmarks show that our approach achieves strong performance, is competitive to oracle support in some settings, scales to large unlabelled video collections, and can be combined in a semi-supervised setting. |