2025
|
| Aritra Bhowmik, Pascal Mettes, Martin R Oswald, Cees G M Snoek: Union-over-Intersections: Object Detection beyond Winner-Takes-All. In: ICLR, 2025. @inproceedings{BhowmikICLR2025,
title = {Union-over-Intersections: Object Detection beyond Winner-Takes-All},
author = {Aritra Bhowmik and Pascal Mettes and Martin R Oswald and Cees G M Snoek},
url = {https://arxiv.org/abs/2311.18512},
year = {2025},
date = {2025-04-24},
urldate = {2024-12-16},
booktitle = {ICLR},
abstract = {This paper revisits the problem of predicting box locations in object detection architectures. Typically, each box proposal or box query aims to directly maximize the intersection-over-union score with the ground truth, followed by a winner-takes-all non-maximum suppression where only the highest scoring box in each region is retained. We observe that both steps are sub-optimal: the first involves regressing proposals to the entire ground truth, which is a difficult task even with large receptive fields, and the second neglects valuable information from boxes other than the top candidate. Instead of regressing proposals to the whole ground truth, we propose a simpler approach: regress only to the area of intersection between the proposal and the ground truth. This avoids the need for proposals to extrapolate beyond their visual scope, improving localization accuracy. Rather than adopting a winner-takes-all strategy, we take the union over the regressed intersections of all boxes in a region to generate the final box outputs. Our plug-and-play method integrates seamlessly into proposal-based, grid-based, and query-based detection architectures with minimal modifications, consistently improving object localization and instance segmentation. We demonstrate its broad applicability and versatility across various detection and segmentation tasks.},
howpublished = {arXiv:2311.18512},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper revisits the problem of predicting box locations in object detection architectures. Typically, each box proposal or box query aims to directly maximize the intersection-over-union score with the ground truth, followed by a winner-takes-all non-maximum suppression where only the highest scoring box in each region is retained. We observe that both steps are sub-optimal: the first involves regressing proposals to the entire ground truth, which is a difficult task even with large receptive fields, and the second neglects valuable information from boxes other than the top candidate. Instead of regressing proposals to the whole ground truth, we propose a simpler approach: regress only to the area of intersection between the proposal and the ground truth. This avoids the need for proposals to extrapolate beyond their visual scope, improving localization accuracy. Rather than adopting a winner-takes-all strategy, we take the union over the regressed intersections of all boxes in a region to generate the final box outputs. Our plug-and-play method integrates seamlessly into proposal-based, grid-based, and query-based detection architectures with minimal modifications, consistently improving object localization and instance segmentation. We demonstrate its broad applicability and versatility across various detection and segmentation tasks. |
| Ivona Najdenkoska, Mohammad Mahdi Derakhshani, Yuki M Asano, Nanne van Noord, Marcel Worring, Cees G M Snoek
: TULIP: Token-length Upgraded CLIP. In: ICLR, 2025. @inproceedings{NajdenkoskaICLR25,
title = {TULIP: Token-length Upgraded CLIP},
author = {Ivona Najdenkoska and Mohammad Mahdi Derakhshani and Yuki M Asano and Nanne van Noord and Marcel Worring and Cees G M Snoek
},
url = {https://arxiv.org/abs/2410.10034},
year = {2025},
date = {2025-04-24},
urldate = {2024-10-13},
booktitle = {ICLR},
abstract = {We address the challenge of representing long captions in vision-language models, such as CLIP. By design these models are limited by fixed, absolute positional encodings, restricting inputs to a maximum of 77 tokens and hindering performance on tasks requiring longer descriptions. Although recent work has attempted to overcome this limit, their proposed approaches struggle to model token relationships over longer distances and simply extend to a fixed new token length. Instead, we propose a generalizable method, named TULIP, able to upgrade the token length to any length for CLIP-like models. We do so by improving the architecture with relative position encodings, followed by a training procedure that (i) distills the original CLIP text encoder into an encoder with relative position encodings and (ii) enhances the model for aligning longer captions with images. By effectively encoding captions longer than the default 77 tokens, our model outperforms baselines on cross-modal tasks such as retrieval and text-to-image generation.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We address the challenge of representing long captions in vision-language models, such as CLIP. By design these models are limited by fixed, absolute positional encodings, restricting inputs to a maximum of 77 tokens and hindering performance on tasks requiring longer descriptions. Although recent work has attempted to overcome this limit, their proposed approaches struggle to model token relationships over longer distances and simply extend to a fixed new token length. Instead, we propose a generalizable method, named TULIP, able to upgrade the token length to any length for CLIP-like models. We do so by improving the architecture with relative position encodings, followed by a training procedure that (i) distills the original CLIP text encoder into an encoder with relative position encodings and (ii) enhances the model for aligning longer captions with images. By effectively encoding captions longer than the default 77 tokens, our model outperforms baselines on cross-modal tasks such as retrieval and text-to-image generation. |
| Christina Sartzetaki, Gemma Roig, Cees G M Snoek, Iris I A Groen: One Hundred Neural Networks and Brains Watching Videos: Lessons from Alignment. In: ICLR, 2025. @inproceedings{SartzetakiICLR2025,
title = {One Hundred Neural Networks and Brains Watching Videos: Lessons from Alignment},
author = {Christina Sartzetaki and Gemma Roig and Cees G M Snoek and Iris I A Groen},
url = {https://www.biorxiv.org/content/10.1101/2024.12.05.626975v1},
year = {2025},
date = {2025-04-24},
urldate = {2024-12-09},
booktitle = {ICLR},
abstract = {What can we learn from comparing video models to human brains, arguably the most efficient and effective video processing systems in existence? Our work takes a step towards answering this question by performing the first large-scale benchmarking of deep video models on representational alignment to the human brain, using publicly available models and a recently released video brain imaging (fMRI) dataset. We disentangle four factors of variation in the models (temporal modeling, classification task, architecture, and training dataset) that affect alignment to the brain, which we measure by conducting Representational Similarity Analysis across multiple brain regions and model layers. We show that temporal modeling is key for alignment to brain regions involved in early visual processing, while a relevant classification task is key for alignment to higher-level regions. Moreover, we identify clear differences between the brain scoring patterns across layers of CNNs and Transformers, and reveal how training dataset biases transfer to alignment with functionally selective brain areas. Additionally, we uncover a negative correlation of computational complexity to brain alignment. Measuring a total of 99 neural networks and 10 human brains watching videos, we aim to forge a path that widens our understanding of temporal and semantic video representations in brains and machines, ideally leading towards more efficient video models and more mechanistic explanations of processing in the human brain.},
howpublished = {bioRxiv 2024.12.05.626975},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
What can we learn from comparing video models to human brains, arguably the most efficient and effective video processing systems in existence? Our work takes a step towards answering this question by performing the first large-scale benchmarking of deep video models on representational alignment to the human brain, using publicly available models and a recently released video brain imaging (fMRI) dataset. We disentangle four factors of variation in the models (temporal modeling, classification task, architecture, and training dataset) that affect alignment to the brain, which we measure by conducting Representational Similarity Analysis across multiple brain regions and model layers. We show that temporal modeling is key for alignment to brain regions involved in early visual processing, while a relevant classification task is key for alignment to higher-level regions. Moreover, we identify clear differences between the brain scoring patterns across layers of CNNs and Transformers, and reveal how training dataset biases transfer to alignment with functionally selective brain areas. Additionally, we uncover a negative correlation of computational complexity to brain alignment. Measuring a total of 99 neural networks and 10 human brains watching videos, we aim to forge a path that widens our understanding of temporal and semantic video representations in brains and machines, ideally leading towards more efficient video models and more mechanistic explanations of processing in the human brain. |
| Jie Liu, Pan Zhou, Yingjun Du, Ah-Hwee Tan, Cees G M Snoek, Jan-Jakob Sonke, Efstratios Gavves: CaPo: Cooperative Plan Optimization for Efficient Embodied Multi-Agent Cooperation. In: ICLR, 2025. @inproceedings{LiuICLR2025,
title = {CaPo: Cooperative Plan Optimization for Efficient Embodied Multi-Agent Cooperation},
author = {Jie Liu and Pan Zhou and Yingjun Du and Ah-Hwee Tan and Cees G M Snoek and Jan-Jakob Sonke and Efstratios Gavves},
url = {https://arxiv.org/abs/2411.04679},
year = {2025},
date = {2025-04-24},
booktitle = {ICLR},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
|
| Zehao Xiao, Shilin Yan, Jack Hong, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiayi Shen, Cheems Wang, Cees G M Snoek: DynaPrompt: Dynamic Test-Time Prompt Tuning. In: ICLR, 2025. @inproceedings{XiaoICLR2025,
title = {DynaPrompt: Dynamic Test-Time Prompt Tuning},
author = {Zehao Xiao and Shilin Yan and Jack Hong and Jiayin Cai and Xiaolong Jiang and Yao Hu and Jiayi Shen and Cheems Wang and Cees G M Snoek},
year = {2025},
date = {2025-04-24},
booktitle = {ICLR},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
|
| Piyush Bagad, Makarand Tapaswi, Cees G M Snoek, Andrew Zisserman: The Sound of Water: Inferring Physical Properties from Pouring Liquids. In: ICASSP, 2025. @inproceedings{BagadICASSP2025,
title = {The Sound of Water: Inferring Physical Properties from Pouring Liquids},
author = {Piyush Bagad and Makarand Tapaswi and Cees G M Snoek and Andrew Zisserman},
url = {https://bpiyush.github.io/pouring-water-website/
https://huggingface.co/spaces/bpiyush/SoundOfWater
https://arxiv.org/abs/2411.11222},
year = {2025},
date = {2025-04-06},
urldate = {2024-11-18},
booktitle = {ICASSP},
abstract = {We study the connection between audio-visual observations and the underlying physics of a mundane yet intriguing everyday activity: pouring liquids. Given only the sound of liquid pouring into a container, our objective is to automatically infer physical properties such as the liquid level, the shape and size of the container, the pouring rate and the time to fill. To this end, we: (i) show in theory that these properties can be determined from the fundamental frequency (pitch); (ii) train a pitch detection model with supervision from simulated data and visual data with a physics-inspired objective; (iii) introduce a new large dataset of real pouring videos for a systematic study; (iv) show that the trained model can indeed infer these physical properties for real data; and finally, (v) we demonstrate strong generalization to various container shapes, other datasets, and in-the-wild YouTube videos. Our work presents a keen understanding of a narrow yet rich problem at the intersection of acoustics, physics, and learning. It opens up applications to enhance multisensory perception in robotic pouring.},
howpublished = {arXiv:2411.11222},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We study the connection between audio-visual observations and the underlying physics of a mundane yet intriguing everyday activity: pouring liquids. Given only the sound of liquid pouring into a container, our objective is to automatically infer physical properties such as the liquid level, the shape and size of the container, the pouring rate and the time to fill. To this end, we: (i) show in theory that these properties can be determined from the fundamental frequency (pitch); (ii) train a pitch detection model with supervision from simulated data and visual data with a physics-inspired objective; (iii) introduce a new large dataset of real pouring videos for a systematic study; (iv) show that the trained model can indeed infer these physical properties for real data; and finally, (v) we demonstrate strong generalization to various container shapes, other datasets, and in-the-wild YouTube videos. Our work presents a keen understanding of a narrow yet rich problem at the intersection of acoustics, physics, and learning. It opens up applications to enhance multisensory perception in robotic pouring. |
| Sameer Ambekar, Zehao Xiao, Xiantong Zhen, Cees G M Snoek: GeneralizeFormer: Layer-Adaptive Model Generation across Test-Time Distribution Shifts. In: WACV, 2025. @inproceedings{AmbekarWACV25,
title = {GeneralizeFormer: Layer-Adaptive Model Generation across Test-Time Distribution Shifts},
author = {Sameer Ambekar and Zehao Xiao and Xiantong Zhen and Cees G M Snoek},
year = {2025},
date = {2025-03-01},
urldate = {2025-03-01},
booktitle = {WACV},
abstract = {We consider the problem of test-time domain generalization, where a model is trained on several source domains and adjusted on target domains never seen during training. Different from the common methods that fine-tune the model or adjust the classifier parameters online, we propose to generate multiple layer parameters on the fly during inference by a lightweight meta-learned transformer, which we call GeneralizeFormer. The layer-wise parameters are generated per target batch without fine-tuning or online adjustment. By doing so, our method is more effective in dynamic scenarios with multiple target distributions and also avoids forgetting valuable source distribution characteristics. Moreover, by considering layer-wise gradients, the proposed method adapts itself to various distribution shifts. To reduce the computational and time cost, we fix the convolutional parameters while only generating parameters of the Batch Normalization layers and the linear classifier. Experiments on six widely used domain generalization datasets demonstrate the benefits and abilities of the proposed method to efficiently handle various distribution shifts, generalize in dynamic scenarios, and avoid forgetting.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We consider the problem of test-time domain generalization, where a model is trained on several source domains and adjusted on target domains never seen during training. Different from the common methods that fine-tune the model or adjust the classifier parameters online, we propose to generate multiple layer parameters on the fly during inference by a lightweight meta-learned transformer, which we call GeneralizeFormer. The layer-wise parameters are generated per target batch without fine-tuning or online adjustment. By doing so, our method is more effective in dynamic scenarios with multiple target distributions and also avoids forgetting valuable source distribution characteristics. Moreover, by considering layer-wise gradients, the proposed method adapts itself to various distribution shifts. To reduce the computational and time cost, we fix the convolutional parameters while only generating parameters of the Batch Normalization layers and the linear classifier. Experiments on six widely used domain generalization datasets demonstrate the benefits and abilities of the proposed method to efficiently handle various distribution shifts, generalize in dynamic scenarios, and avoid forgetting. |
| Duy-Kien Nguyen, Martin R Oswald, Cees G M Snoek: SimPLR: A Simple and Plain Transformer for Scaling-Efficient Object Detection and Segmentation. In: Transactions on Machine Learning Research, 2025, (Pending minor revision). @article{NguyenTMLR2025,
title = {SimPLR: A Simple and Plain Transformer for Scaling-Efficient Object Detection and Segmentation},
author = {Duy-Kien Nguyen and Martin R Oswald and Cees G M Snoek},
url = {https://arxiv.org/abs/2310.05920},
year = {2025},
date = {2025-01-20},
urldate = {2023-10-09},
journal = {Transactions on Machine Learning Research},
abstract = {The ability to detect objects in images at varying scales has played a pivotal role in the design of modern object detectors. Despite considerable progress in removing hand-crafted components and simplifying the architecture with transformers, multi-scale feature maps and/or pyramid design remain a key factor for their empirical success. In this paper, we show that this reliance on either feature pyramids or an hierarchical backbone is unnecessary and a transformer-based detector with scale-aware attention enables the plain detector `SimPLR' whose backbone and detection head are both non-hierarchical and operate on single-scale features. We find through our experiments that SimPLR with scale-aware attention is plain and simple, yet competitive with multi-scale vision transformer alternatives. Compared to the multi-scale and single-scale state-of-the-art, our model scales much better with bigger capacity (self-supervised) models and more pre-training data, allowing us to report a consistently better accuracy and faster runtime for object detection, instance segmentation as well as panoptic segmentation. Code will be released.},
howpublished = {arXiv:2310.05920},
note = {Pending minor revision},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
The ability to detect objects in images at varying scales has played a pivotal role in the design of modern object detectors. Despite considerable progress in removing hand-crafted components and simplifying the architecture with transformers, multi-scale feature maps and/or pyramid design remain a key factor for their empirical success. In this paper, we show that this reliance on either feature pyramids or an hierarchical backbone is unnecessary and a transformer-based detector with scale-aware attention enables the plain detector `SimPLR' whose backbone and detection head are both non-hierarchical and operate on single-scale features. We find through our experiments that SimPLR with scale-aware attention is plain and simple, yet competitive with multi-scale vision transformer alternatives. Compared to the multi-scale and single-scale state-of-the-art, our model scales much better with bigger capacity (self-supervised) models and more pre-training data, allowing us to report a consistently better accuracy and faster runtime for object detection, instance segmentation as well as panoptic segmentation. Code will be released. |
| Huabin Liu, Filip Ilievski, Cees G M Snoek: Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning. arXiv:2501.05069, 2025. @unpublished{LiuArxiv2025,
title = {Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning},
author = {Huabin Liu and Filip Ilievski and Cees G M Snoek},
url = {https://arxiv.org/abs/2501.05069},
year = {2025},
date = {2025-01-09},
urldate = {2025-01-09},
abstract = {This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds VQA tasks to video fragments in four steps: entailment tree construction, video-language entailment verification, tree reasoning, and dynamic tree expansion. A vital benefit of the method is its generalizability to current video and image-based VLMs across reasoning types. To support fair evaluation, we devise a de-biasing procedure based on large-language models that rewrites VQA benchmark answer sets to enforce model reasoning. Systematic experiments on existing and de-biased benchmarks highlight the impact of our method components across benchmarks, VLMs, and reasoning types.},
howpublished = {arXiv:2501.05069},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds VQA tasks to video fragments in four steps: entailment tree construction, video-language entailment verification, tree reasoning, and dynamic tree expansion. A vital benefit of the method is its generalizability to current video and image-based VLMs across reasoning types. To support fair evaluation, we devise a de-biasing procedure based on large-language models that rewrites VQA benchmark answer sets to enforce model reasoning. Systematic experiments on existing and de-biased benchmarks highlight the impact of our method components across benchmarks, VLMs, and reasoning types. |
2024
|
| Yingjun Du, Wenfang Sun, Cees G M Snoek: IPO: Interpretable Prompt Optimization for Vision-Language Models. In: NeurIPS, 2024. @inproceedings{DuNeurips2024,
title = {IPO: Interpretable Prompt Optimization for Vision-Language Models},
author = {Yingjun Du and Wenfang Sun and Cees G M Snoek},
url = {https://arxiv.org/abs/2410.15397},
year = {2024},
date = {2024-12-09},
urldate = {2024-12-09},
booktitle = {NeurIPS},
abstract = {Pre-trained vision-language models like CLIP have remarkably adapted to various downstream tasks. Nonetheless, their performance heavily depends on the specificity of the input text prompts, which requires skillful prompt template engineering. Instead, current approaches to prompt optimization learn the prompts through gradient descent, where the prompts are treated as adjustable parameters. However, these methods tend to lead to overfitting of the base classes seen during training and produce prompts that are no longer understandable by humans. This paper introduces a simple but interpretable prompt optimizer (IPO), that utilizes large language models (LLMs) to generate textual prompts dynamically. We introduce a Prompt Optimization Prompt that not only guides LLMs in creating effective prompts but also stores past prompts with their performance metrics, providing rich in-context information. Additionally, we incorporate a large multimodal model (LMM) to condition on visual content by generating image descriptions, which enhance the interaction between textual and visual modalities. This allows for the creation of dataset-specific prompts that improve generalization performance, while maintaining human comprehension. Extensive testing across 11 datasets reveals that IPO not only improves the accuracy of existing gradient-descent-based prompt learning methods but also considerably enhances the interpretability of the generated prompts. By leveraging the strengths of LLMs, our approach ensures that the prompts remain human-understandable, thereby facilitating better transparency and oversight for vision-language models.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Pre-trained vision-language models like CLIP have remarkably adapted to various downstream tasks. Nonetheless, their performance heavily depends on the specificity of the input text prompts, which requires skillful prompt template engineering. Instead, current approaches to prompt optimization learn the prompts through gradient descent, where the prompts are treated as adjustable parameters. However, these methods tend to lead to overfitting of the base classes seen during training and produce prompts that are no longer understandable by humans. This paper introduces a simple but interpretable prompt optimizer (IPO), that utilizes large language models (LLMs) to generate textual prompts dynamically. We introduce a Prompt Optimization Prompt that not only guides LLMs in creating effective prompts but also stores past prompts with their performance metrics, providing rich in-context information. Additionally, we incorporate a large multimodal model (LMM) to condition on visual content by generating image descriptions, which enhance the interaction between textual and visual modalities. This allows for the creation of dataset-specific prompts that improve generalization performance, while maintaining human comprehension. Extensive testing across 11 datasets reveals that IPO not only improves the accuracy of existing gradient-descent-based prompt learning methods but also considerably enhances the interpretability of the generated prompts. By leveraging the strengths of LLMs, our approach ensures that the prompts remain human-understandable, thereby facilitating better transparency and oversight for vision-language models. |
| Mohammadreza Salehi, Nikolaos Apostolikas, Efstratios Gavves, Cees G M Snoek, Yuki M Asano: Redefining Normal: A Novel Object-Level Approach for Multi-Object Novelty Detection. In: ACCV, 2024, (Oral presentation). @inproceedings{SalehiACCV2024,
title = {Redefining Normal: A Novel Object-Level Approach for Multi-Object Novelty Detection},
author = {Mohammadreza Salehi and Nikolaos Apostolikas and Efstratios Gavves and Cees G M Snoek and Yuki M Asano},
url = {https://github.com/SMSD75/Redefining_Normal_ACCV24/tree/main
https://arxiv.org/abs/2412.11148},
year = {2024},
date = {2024-12-08},
urldate = {2024-12-08},
booktitle = {ACCV},
abstract = {In the realm of novelty detection, accurately identifying outliers in data without specific class information poses a significant challenge. While current methods excel in single-object scenarios, they struggle with multi-object situations due to their focus on individual objects. Our paper suggests a novel approach: redefining `normal' at the object level in training datasets. Rather than the usual image-level view, we consider the most dominant object in a dataset as the norm, offering a perspective that is more effective for real-world scenarios. Adapting to our object-level definition of `normal', we modify knowledge distillation frameworks, where a student network learns from a pre-trained teacher network. Our first contribution, DeFeND(Dense Feature Fine-tuning on Normal Data), integrates dense feature fine-tuning into the distillation process, allowing the teacher network to focus on object-level features with a self-supervised loss. The second is masked knowledge distillation, where the student network works with partially hidden inputs, honing its ability to deduce and generalize from incomplete data. This approach not only fares well in single-object novelty detection but also considerably surpasses existing methods in multi-object contexts.},
note = {Oral presentation},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In the realm of novelty detection, accurately identifying outliers in data without specific class information poses a significant challenge. While current methods excel in single-object scenarios, they struggle with multi-object situations due to their focus on individual objects. Our paper suggests a novel approach: redefining `normal' at the object level in training datasets. Rather than the usual image-level view, we consider the most dominant object in a dataset as the norm, offering a perspective that is more effective for real-world scenarios. Adapting to our object-level definition of `normal', we modify knowledge distillation frameworks, where a student network learns from a pre-trained teacher network. Our first contribution, DeFeND(Dense Feature Fine-tuning on Normal Data), integrates dense feature fine-tuning into the distillation process, allowing the teacher network to focus on object-level features with a self-supervised loss. The second is masked knowledge distillation, where the student network works with partially hidden inputs, honing its ability to deduce and generalize from incomplete data. This approach not only fares well in single-object novelty detection but also considerably surpasses existing methods in multi-object contexts. |
| Aozhu Chen, Hazel Doughty, Xirong Li, Cees G M Snoek: Beyond Coarse-Grained Matching in Video-Text Retrieval. In: ACCV, 2024, (Oral presentation). @inproceedings{ChenACCV2024,
title = {Beyond Coarse-Grained Matching in Video-Text Retrieval},
author = {Aozhu Chen and Hazel Doughty and Xirong Li and Cees G M Snoek},
url = {https://arxiv.org/abs/2410.12407},
year = {2024},
date = {2024-12-08},
urldate = {2024-12-08},
booktitle = {ACCV},
abstract = {Video-text retrieval has seen significant advancements, yet the ability of models to discern subtle differences in captions still requires verification. In this paper, we introduce a new approach for fine-grained evaluation. Our approach can be applied to existing datasets by automatically generating hard negative test captions with subtle single-word variations across nouns, verbs, adjectives, adverbs, and prepositions. We perform comprehensive experiments using four state-of-the-art models across two standard benchmarks (MSR-VTT and VATEX) and two specially curated datasets enriched with detailed descriptions (VLN-UVO and VLN-OOPS), resulting in a number of novel insights: 1) our analyses show that the current evaluation benchmarks fall short in detecting a model's ability to perceive subtle single-word differences, 2) our fine-grained evaluation highlights the difficulty models face in distinguishing such subtle variations. To enhance fine-grained understanding, we propose a new baseline that can be easily combined with current methods. Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.},
note = {Oral presentation},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Video-text retrieval has seen significant advancements, yet the ability of models to discern subtle differences in captions still requires verification. In this paper, we introduce a new approach for fine-grained evaluation. Our approach can be applied to existing datasets by automatically generating hard negative test captions with subtle single-word variations across nouns, verbs, adjectives, adverbs, and prepositions. We perform comprehensive experiments using four state-of-the-art models across two standard benchmarks (MSR-VTT and VATEX) and two specially curated datasets enriched with detailed descriptions (VLN-UVO and VLN-OOPS), resulting in a number of novel insights: 1) our analyses show that the current evaluation benchmarks fall short in detecting a model's ability to perceive subtle single-word differences, 2) our fine-grained evaluation highlights the difficulty models face in distinguishing such subtle variations. To enhance fine-grained understanding, we propose a new baseline that can be easily combined with current methods. Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences. |
| Hazel Doughty, Fida Mohammad Thoker, Cees G M Snoek: LocoMotion: Learning Motion-Focused Video-Language Representations. In: ACCV, 2024, (Oral presentation). @inproceedings{DoughtyACCV2024,
title = {LocoMotion: Learning Motion-Focused Video-Language Representations},
author = {Hazel Doughty and Fida Mohammad Thoker and Cees G M Snoek},
url = {https://hazeldoughty.github.io/Papers/LocoMotion/
https://arxiv.org/abs/2410.12018},
year = {2024},
date = {2024-12-08},
urldate = {2024-12-08},
booktitle = {ACCV},
abstract = {This paper strives for motion-focused video-language representations. Existing methods to learn video-language representations use spatial-focused data, where identifying the objects and scene is often enough to distinguish the relevant caption. We instead propose LocoMotion to learn from motion-focused captions that describe the movement and temporal progression of local object motions. We achieve this by adding synthetic motions to videos and using the parameters of these motions to generate corresponding captions. Furthermore, we propose verb-variation paraphrasing to increase the caption variety and learn the link between primitive motions and high-level verbs. With this, we are able to learn a motion-focused video-language representation. Experiments demonstrate our approach is effective for a variety of downstream tasks, particularly when limited data is available for fine-tuning.},
note = {Oral presentation},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper strives for motion-focused video-language representations. Existing methods to learn video-language representations use spatial-focused data, where identifying the objects and scene is often enough to distinguish the relevant caption. We instead propose LocoMotion to learn from motion-focused captions that describe the movement and temporal progression of local object motions. We achieve this by adding synthetic motions to videos and using the parameters of these motions to generate corresponding captions. Furthermore, we propose verb-variation paraphrasing to increase the caption variety and learn the link between primitive motions and high-level verbs. With this, we are able to learn a motion-focused video-language representation. Experiments demonstrate our approach is effective for a variety of downstream tasks, particularly when limited data is available for fine-tuning. |
| Wenfang Sun, Yingjun Du, Gaowen Liu, Cees G M Snoek: QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain. arXiv:2411.19534, 2024. @unpublished{SunArxiv2024b,
title = {QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain},
author = {Wenfang Sun and Yingjun Du and Gaowen Liu and Cees G M Snoek},
url = {https://arxiv.org/abs/2411.19534},
year = {2024},
date = {2024-11-29},
abstract = {We tackle the problem of quantifying the number of objects by a generative text-to-image model. Rather than retraining such a model for each new image domain of interest, which leads to high computational costs and limited scalability, we are the first to consider this problem from a domain-agnostic perspective. We propose QUOTA, an optimization framework for text-to-image models that enables effective object quantification across unseen domains without retraining. It leverages a dual-loop meta-learning strategy to optimize a domain-invariant prompt. Further, by integrating prompt learning with learnable counting and domain tokens, our method captures stylistic variations and maintains accuracy, even for object classes not encountered during training. For evaluation, we adopt a new benchmark specifically designed for object quantification in domain generalization, enabling rigorous assessment of object quantification accuracy and adaptability across unseen domains in text-to-image generation. Extensive experiments demonstrate that QUOTA outperforms conventional models in both object quantification accuracy and semantic consistency, setting a new benchmark for efficient and scalable text-to-image generation for any domain.},
howpublished = {arXiv:2411.19534},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
We tackle the problem of quantifying the number of objects by a generative text-to-image model. Rather than retraining such a model for each new image domain of interest, which leads to high computational costs and limited scalability, we are the first to consider this problem from a domain-agnostic perspective. We propose QUOTA, an optimization framework for text-to-image models that enables effective object quantification across unseen domains without retraining. It leverages a dual-loop meta-learning strategy to optimize a domain-invariant prompt. Further, by integrating prompt learning with learnable counting and domain tokens, our method captures stylistic variations and maintains accuracy, even for object classes not encountered during training. For evaluation, we adopt a new benchmark specifically designed for object quantification in domain generalization, enabling rigorous assessment of object quantification accuracy and adaptability across unseen domains in text-to-image generation. Extensive experiments demonstrate that QUOTA outperforms conventional models in both object quantification accuracy and semantic consistency, setting a new benchmark for efficient and scalable text-to-image generation for any domain. |
| Yunhua Zhang, Hazel Doughty, Cees G M Snoek: Day2Dark: Pseudo-Supervised Activity Recognition beyond Silent Daylight. In: International Journal of Computer Vision, 2024, (In press). @article{ZhangIJCV2024,
title = {Day2Dark: Pseudo-Supervised Activity Recognition beyond Silent Daylight},
author = {Yunhua Zhang and Hazel Doughty and Cees G M Snoek},
url = {https://arxiv.org/abs/2212.02053
https://link.springer.com/article/10.1007/s11263-024-02273-7},
year = {2024},
date = {2024-11-06},
urldate = {2024-01-01},
journal = {International Journal of Computer Vision},
abstract = {This paper strives to recognize activities in the dark, as well as in the day. We first establish that state-of-the-art activity recognizers are effective during the day, but not trustworthy in the dark. The main causes are the limited availability of labeled dark videos to learn from, as well as the distribution shift towards the lower color contrast at test-time. To compensate for the lack of labeled dark videos, we introduce a pseudo-supervised learning scheme, which utilizes easy to obtain unlabeled and task-irrelevant dark videos to improve an activity recognizer in low light. As the lower color contrast results in visual information loss, we further propose to incorporate the complementary activity information within audio, which is invariant to illumination. Since the usefulness of audio and visual features differs depending on the amount of illumination, we introduce our `darkness-adaptive' audio-visual recognizer. Experiments on EPIC-Kitchens, Kinetics-Sound, and Charades demonstrate our proposals are superior to image enhancement, domain adaptation and alternative audio-visual fusion methods, and can even improve robustness to local darkness caused by occlusions. },
note = {In press},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
This paper strives to recognize activities in the dark, as well as in the day. We first establish that state-of-the-art activity recognizers are effective during the day, but not trustworthy in the dark. The main causes are the limited availability of labeled dark videos to learn from, as well as the distribution shift towards the lower color contrast at test-time. To compensate for the lack of labeled dark videos, we introduce a pseudo-supervised learning scheme, which utilizes easy to obtain unlabeled and task-irrelevant dark videos to improve an activity recognizer in low light. As the lower color contrast results in visual information loss, we further propose to incorporate the complementary activity information within audio, which is invariant to illumination. Since the usefulness of audio and visual features differs depending on the amount of illumination, we introduce our `darkness-adaptive' audio-visual recognizer. Experiments on EPIC-Kitchens, Kinetics-Sound, and Charades demonstrate our proposals are superior to image enhancement, domain adaptation and alternative audio-visual fusion methods, and can even improve robustness to local darkness caused by occlusions. |
| Zehao Xiao, Cees G M Snoek: Beyond Model Adaptation at Test Time: A Survey. arXiv:2411.03687, 2024. @unpublished{XiaoArxiv2024,
title = {Beyond Model Adaptation at Test Time: A Survey},
author = {Zehao Xiao and Cees G M Snoek},
url = {https://arxiv.org/abs/2411.03687
https://github.com/zzzx1224/Beyond-model-adaptation-at-test-time-Papers},
year = {2024},
date = {2024-11-06},
urldate = {2024-11-06},
abstract = {Machine learning algorithms have achieved remarkable success across various disciplines, use cases and applications, under the prevailing assumption that training and test samples are drawn from the same distribution. Consequently, these algorithms struggle and become brittle even when samples in the test distribution start to deviate from the ones observed during training. Domain adaptation and domain generalization have been studied extensively as approaches to address distribution shifts across test and train domains, but each has its limitations. Test-time adaptation, a recently emerging learning paradigm, combines the benefits of domain adaptation and domain generalization by training models only on source data and adapting them to target data during test-time inference. In this survey, we provide a comprehensive and systematic review on test-time adaptation, covering more than 400 recent papers. We structure our review by categorizing existing methods into five distinct categories based on what component of the method is adjusted for test-time adaptation: the model, the inference, the normalization, the sample, or the prompt, providing detailed analysis of each. We further discuss the various preparation and adaptation settings for methods within these categories, offering deeper insights into the effective deployment for the evaluation of distribution shifts and their real-world application in understanding images, video and 3D, as well as modalities beyond vision. We close the survey with an outlook on emerging research opportunities for test-time adaptation.},
howpublished = {arXiv:2411.03687},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
Machine learning algorithms have achieved remarkable success across various disciplines, use cases and applications, under the prevailing assumption that training and test samples are drawn from the same distribution. Consequently, these algorithms struggle and become brittle even when samples in the test distribution start to deviate from the ones observed during training. Domain adaptation and domain generalization have been studied extensively as approaches to address distribution shifts across test and train domains, but each has its limitations. Test-time adaptation, a recently emerging learning paradigm, combines the benefits of domain adaptation and domain generalization by training models only on source data and adapting them to target data during test-time inference. In this survey, we provide a comprehensive and systematic review on test-time adaptation, covering more than 400 recent papers. We structure our review by categorizing existing methods into five distinct categories based on what component of the method is adjusted for test-time adaptation: the model, the inference, the normalization, the sample, or the prompt, providing detailed analysis of each. We further discuss the various preparation and adaptation settings for methods within these categories, offering deeper insights into the effective deployment for the evaluation of distribution shifts and their real-world application in understanding images, video and 3D, as well as modalities beyond vision. We close the survey with an outlook on emerging research opportunities for test-time adaptation. |
| Yingjun Du, Gaowen Liu, Yuzhang Shang, Yuguang Yao, Ramana Kompella, Cees G M Snoek: Prompt Diffusion Robustifies Any-Modality Prompt Learning. arXiv:2410.20164, 2024. @unpublished{DuArxiv24,
title = {Prompt Diffusion Robustifies Any-Modality Prompt Learning},
author = {Yingjun Du and Gaowen Liu and Yuzhang Shang and Yuguang Yao and Ramana Kompella and Cees G M Snoek},
url = {https://arxiv.org/abs/2410.20164},
year = {2024},
date = {2024-10-26},
urldate = {2024-10-26},
abstract = {Foundation models enable prompt-based classifiers for zero-shot and few-shot learning. Nonetheless, the conventional method of employing fixed prompts suffers from distributional shifts that negatively impact generalizability to unseen samples. This paper introduces prompt diffusion, which uses a diffusion model to gradually refine the prompts to obtain a customized prompt for each sample. Specifically, we first optimize a collection of prompts to obtain over-fitted prompts per sample. Then, we propose a prompt diffusion model within the prompt space, enabling the training of a generative transition process from a random prompt to its overfitted prompt. As we cannot access the label of a test image during inference, our model gradually generates customized prompts solely from random prompts using our trained, prompt diffusion. Our prompt diffusion is generic, flexible, and modality-agnostic, making it a simple plug-and-play module seamlessly embedded into existing prompt learning methods for textual, visual, or multi-modal prompt learning. Our diffusion model uses a fast ODE-based sampling strategy to optimize test sample prompts in just five steps, offering a good trade-off between performance improvement and computational efficiency. For all prompt learning methods tested, adding prompt diffusion yields more robust results for base-to-new generalization, cross-dataset generalization, and domain generalization in classification tasks tested over 15 diverse datasets.},
howpublished = {arXiv:2410.20164},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
Foundation models enable prompt-based classifiers for zero-shot and few-shot learning. Nonetheless, the conventional method of employing fixed prompts suffers from distributional shifts that negatively impact generalizability to unseen samples. This paper introduces prompt diffusion, which uses a diffusion model to gradually refine the prompts to obtain a customized prompt for each sample. Specifically, we first optimize a collection of prompts to obtain over-fitted prompts per sample. Then, we propose a prompt diffusion model within the prompt space, enabling the training of a generative transition process from a random prompt to its overfitted prompt. As we cannot access the label of a test image during inference, our model gradually generates customized prompts solely from random prompts using our trained, prompt diffusion. Our prompt diffusion is generic, flexible, and modality-agnostic, making it a simple plug-and-play module seamlessly embedded into existing prompt learning methods for textual, visual, or multi-modal prompt learning. Our diffusion model uses a fast ODE-based sampling strategy to optimize test sample prompts in just five steps, offering a good trade-off between performance improvement and computational efficiency. For all prompt learning methods tested, adding prompt diffusion yields more robust results for base-to-new generalization, cross-dataset generalization, and domain generalization in classification tasks tested over 15 diverse datasets. |
| Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees G M Snoek, Yuki M Asano: TVBench: Redesigning Video-Language Evaluation. arXiv:2410.07752, 2024. @unpublished{CoresArxiv2024,
title = {TVBench: Redesigning Video-Language Evaluation},
author = {Daniel Cores and Michael Dorkenwald and Manuel Mucientes and Cees G M Snoek and Yuki M Asano},
url = {https://arxiv.org/abs/2410.07752},
year = {2024},
date = {2024-10-10},
abstract = {Large language models have demonstrated impressive performance when integrated with vision models even enabling video understanding. However, evaluating these video models presents its own unique challenges, for which several benchmarks have been proposed. In this paper, we show that the currently most used video-language benchmarks can be solved without requiring much temporal reasoning. We identified three main issues in existing datasets: (i) static information from single frames is often sufficient to solve the tasks (ii) the text of the questions and candidate answers is overly informative, allowing models to answer correctly without relying on any visual input (iii) world knowledge alone can answer many of the questions, making the benchmarks a test of knowledge replication rather than visual reasoning. In addition, we found that open-ended question-answering benchmarks for video understanding suffer from similar issues while the automatic evaluation process with LLMs is unreliable, making it an unsuitable alternative. As a solution, we propose TVBench, a novel open-source video multiple-choice question-answering benchmark, and demonstrate through extensive evaluations that it requires a high level of temporal understanding. Surprisingly, we find that most recent state-of-the-art video-language models perform similarly to random performance on TVBench, with only Gemini-Pro and Tarsier clearly surpassing this baseline.},
howpublished = {arXiv:2410.07752},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
Large language models have demonstrated impressive performance when integrated with vision models even enabling video understanding. However, evaluating these video models presents its own unique challenges, for which several benchmarks have been proposed. In this paper, we show that the currently most used video-language benchmarks can be solved without requiring much temporal reasoning. We identified three main issues in existing datasets: (i) static information from single frames is often sufficient to solve the tasks (ii) the text of the questions and candidate answers is overly informative, allowing models to answer correctly without relying on any visual input (iii) world knowledge alone can answer many of the questions, making the benchmarks a test of knowledge replication rather than visual reasoning. In addition, we found that open-ended question-answering benchmarks for video understanding suffer from similar issues while the automatic evaluation process with LLMs is unreliable, making it an unsuitable alternative. As a solution, we propose TVBench, a novel open-source video multiple-choice question-answering benchmark, and demonstrate through extensive evaluations that it requires a high level of temporal understanding. Surprisingly, we find that most recent state-of-the-art video-language models perform similarly to random performance on TVBench, with only Gemini-Pro and Tarsier clearly surpassing this baseline. |
| Mohammadreza Salehi, Michael Dorkenwald, Fida Mohammad Thoker, Efstratios Gavves, Cees G M Snoek, Yuki M Asano: SIGMA: Sinkhorn-Guided Masked Video Modeling. In: ECCV, 2024. @inproceedings{SalehiECCV2024,
title = {SIGMA: Sinkhorn-Guided Masked Video Modeling},
author = {Mohammadreza Salehi and Michael Dorkenwald and Fida Mohammad Thoker and Efstratios Gavves and Cees G M Snoek and Yuki M Asano},
url = {https://quva-lab.github.io/SIGMA/
https://arxiv.org/abs/2407.15447},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
booktitle = {ECCV},
abstract = {Video-based pretraining offers immense potential for learning strong visual representations on an unprecedented scale. Recently, masked video modeling methods have shown promising scalability, yet fall short in capturing higher-level semantics due to reconstructing predefined low-level targets such as pixels. To tackle this, we present Sinkhorn-guided Masked Video Modelling (SIGMA), a novel video pretraining method that jointly learns the video model in addition to a target feature space using a projection network. However, this simple modification means that the regular L2 reconstruction loss will lead to trivial solutions as both networks are jointly optimized. As a solution, we distribute features of space-time tubes evenly across a limited number of learnable clusters. By posing this as an optimal transport problem, we enforce high entropy in the generated features across the batch, infusing semantic and temporal meaning into the feature space. The resulting cluster assignments are used as targets for a symmetric prediction task where the video model predicts cluster assignment of the projection network and vice versa. Experimental results on ten datasets across three benchmarks validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations improving upon state-of-the-art methods.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Video-based pretraining offers immense potential for learning strong visual representations on an unprecedented scale. Recently, masked video modeling methods have shown promising scalability, yet fall short in capturing higher-level semantics due to reconstructing predefined low-level targets such as pixels. To tackle this, we present Sinkhorn-guided Masked Video Modelling (SIGMA), a novel video pretraining method that jointly learns the video model in addition to a target feature space using a projection network. However, this simple modification means that the regular L2 reconstruction loss will lead to trivial solutions as both networks are jointly optimized. As a solution, we distribute features of space-time tubes evenly across a limited number of learnable clusters. By posing this as an optimal transport problem, we enforce high entropy in the generated features across the batch, infusing semantic and temporal meaning into the feature space. The resulting cluster assignments are used as targets for a symmetric prediction task where the video model predicts cluster assignment of the projection network and vice versa. Experimental results on ten datasets across three benchmarks validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations improving upon state-of-the-art methods. |
| Sarah Rastegar, Mohammadreza Salehi, Yuki M Asano, Hazel Doughty, Cees G M Snoek: SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery. In: ECCV, 2024. @inproceedings{RastegarECCV2024,
title = {SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery},
author = {Sarah Rastegar and Mohammadreza Salehi and Yuki M Asano and Hazel Doughty and Cees G M Snoek},
url = {https://arxiv.org/abs/2408.14371
https://github.com/SarahRastegar/SelEx},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
booktitle = {ECCV},
abstract = {In this paper, we address Generalized Category Discovery, aiming to simultaneously uncover novel categories and accurately classify known ones. Traditional methods, which lean heavily on self-supervision and contrastive learning, often fall short when distinguishing between fine-grained categories. To address this, we introduce a novel concept called `self-expertise', which enhances the model's ability to recognize subtle differences and uncover unknown categories. Our approach combines unsupervised and supervised self-expertise strategies to refine the model's discernment and generalization. Initially, hierarchical pseudo-labeling is used to provide `soft supervision', improving the effectiveness of self-expertise. Our supervised technique differs from traditional methods by utilizing more abstract positive and negative samples, aiding in the formation of clusters that can generalize to novel categories. Meanwhile, our unsupervised strategy encourages the model to sharpen its category distinctions by considering within-category examples as `hard' negatives. Supported by theoretical insights, our empirical results showcase that our method outperforms existing state-of-the-art techniques in Generalized Category Discovery across several fine-grained datasets.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper, we address Generalized Category Discovery, aiming to simultaneously uncover novel categories and accurately classify known ones. Traditional methods, which lean heavily on self-supervision and contrastive learning, often fall short when distinguishing between fine-grained categories. To address this, we introduce a novel concept called `self-expertise', which enhances the model's ability to recognize subtle differences and uncover unknown categories. Our approach combines unsupervised and supervised self-expertise strategies to refine the model's discernment and generalization. Initially, hierarchical pseudo-labeling is used to provide `soft supervision', improving the effectiveness of self-expertise. Our supervised technique differs from traditional methods by utilizing more abstract positive and negative samples, aiding in the formation of clusters that can generalize to novel categories. Meanwhile, our unsupervised strategy encourages the model to sharpen its category distinctions by considering within-category examples as `hard' negatives. Supported by theoretical insights, our empirical results showcase that our method outperforms existing state-of-the-art techniques in Generalized Category Discovery across several fine-grained datasets. |
| Luc Sträter, Mohammadreza Salehi, Efstratios Gavves, Cees G M Snoek, Yuki M Asano: GeneralAD: Anomaly Detection Across Domains by Attending to Distorted Features. In: ECCV, 2024. @inproceedings{StraterECCV2024,
title = {GeneralAD: Anomaly Detection Across Domains by Attending to Distorted Features},
author = {Luc Sträter and Mohammadreza Salehi and Efstratios Gavves and Cees G M Snoek and Yuki M Asano},
url = {https://arxiv.org/abs/2407.12427},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
booktitle = {ECCV},
abstract = {In the domain of anomaly detection, methods often excel in either high-level semantic or low-level industrial benchmarks, rarely achieving cross-domain proficiency. Semantic anomalies are novelties that differ in meaning from the training set, like unseen objects in self-driving cars. In contrast, industrial anomalies are subtle defects that preserve semantic meaning, such as cracks in airplane components. In this paper, we present GeneralAD, an anomaly detection framework designed to operate in semantic, near-distribution, and industrial settings with minimal per-task adjustments. In our approach, we capitalize on the inherent design of Vision Transformers, which are trained on image patches, thereby ensuring that the last hidden states retain a patch-based structure. We propose a novel self-supervised anomaly generation module that employs straightforward operations like noise addition and shuffling to patch features to construct pseudo-abnormal samples. These features are fed to an attention-based discriminator, which is trained to score every patch in the image. With this, our method can both accurately identify anomalies at the image level and also generate interpretable anomaly maps. We extensively evaluated our approach on ten datasets, achieving state-of-the-art results in six and on-par performance in the remaining for both localization and detection tasks.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In the domain of anomaly detection, methods often excel in either high-level semantic or low-level industrial benchmarks, rarely achieving cross-domain proficiency. Semantic anomalies are novelties that differ in meaning from the training set, like unseen objects in self-driving cars. In contrast, industrial anomalies are subtle defects that preserve semantic meaning, such as cracks in airplane components. In this paper, we present GeneralAD, an anomaly detection framework designed to operate in semantic, near-distribution, and industrial settings with minimal per-task adjustments. In our approach, we capitalize on the inherent design of Vision Transformers, which are trained on image patches, thereby ensuring that the last hidden states retain a patch-based structure. We propose a novel self-supervised anomaly generation module that employs straightforward operations like noise addition and shuffling to patch features to construct pseudo-abnormal samples. These features are fed to an attention-based discriminator, which is trained to score every patch in the image. With this, our method can both accurately identify anomalies at the image level and also generate interpretable anomaly maps. We extensively evaluated our approach on ten datasets, achieving state-of-the-art results in six and on-par performance in the remaining for both localization and detection tasks. |
| Sameer Ambekar, Zehao Xiao, Jiayi Shen, Xiantong Zhen, Cees G M Snoek: Probabilistic Test-Time Generalization by Variational Neighbor-Labeling. In: CoLLAs, 2024. @inproceedings{AmberkarColla2024,
title = {Probabilistic Test-Time Generalization by Variational Neighbor-Labeling},
author = {Sameer Ambekar and Zehao Xiao and Jiayi Shen and Xiantong Zhen and Cees G M Snoek},
url = {https://arxiv.org/abs/2307.04033},
year = {2024},
date = {2024-07-29},
urldate = {2023-07-15},
booktitle = {CoLLAs},
abstract = {This paper strives for domain generalization, where models are trained exclusively on source domains before being deployed on unseen target domains. We follow the strict separation of source training and target testing, but exploit the value of the unlabeled target data itself during inference. We make three contributions. First, we propose probabilistic pseudo-labeling of target samples to generalize the source-trained model to the target domain at test time. We formulate the generalization at test time as a variational inference problem, by modeling pseudo labels as distributions, to consider the uncertainty during generalization and alleviate the misleading signal of inaccurate pseudo labels. Second, we learn variational neighbor labels that incorporate the information of neighboring target samples to generate more robust pseudo labels. Third, to learn the ability to incorporate more representative target information and generate more precise and robust variational neighbor labels, we introduce a meta-generalization stage during training to simulate the generalization procedure. Experiments on seven widely-used datasets demonstrate the benefits, abilities, and effectiveness of our proposal.},
howpublished = {arXiv:2307.04033},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper strives for domain generalization, where models are trained exclusively on source domains before being deployed on unseen target domains. We follow the strict separation of source training and target testing, but exploit the value of the unlabeled target data itself during inference. We make three contributions. First, we propose probabilistic pseudo-labeling of target samples to generalize the source-trained model to the target domain at test time. We formulate the generalization at test time as a variational inference problem, by modeling pseudo labels as distributions, to consider the uncertainty during generalization and alleviate the misleading signal of inaccurate pseudo labels. Second, we learn variational neighbor labels that incorporate the information of neighboring target samples to generate more robust pseudo labels. Third, to learn the ability to incorporate more representative target information and generate more precise and robust variational neighbor labels, we introduce a meta-generalization stage during training to simulate the generalization procedure. Experiments on seven widely-used datasets demonstrate the benefits, abilities, and effectiveness of our proposal. |
| Zenglin Shi, Pascal Mettes, Cees G M Snoek: Focus for Free in Density-Based Counting. In: International Journal of Computer Vision, vol. 132, iss. 7, pp. 2600-2617, 2024. @article{ShiIJCV2024,
title = {Focus for Free in Density-Based Counting},
author = {Zenglin Shi and Pascal Mettes and Cees G M Snoek},
url = {https://doi.org/10.1007/s11263-024-01990-3
https://arxiv.org/abs/2306.05129},
year = {2024},
date = {2024-07-01},
urldate = {2024-01-01},
journal = {International Journal of Computer Vision},
volume = {132},
issue = {7},
pages = {2600-2617},
abstract = {This work considers supervised learning to count from images and their corresponding point annotations. Where density-based counting methods typically use the point annotations only to create Gaussian-density maps, which act as the supervision signal, the starting point of this work is that point annotations have counting potential beyond density map generation. We introduce two methods that repurpose the available point annotations to enhance counting performance. The first is a counting-specific augmentation that leverages point annotations to simulate occluded objects in both input and density images to enhance the network's robustness to occlusions. The second method, foreground distillation, generates foreground masks from the point annotations, from which we train an auxiliary network on images with blacked-out backgrounds. By doing so, it learns to extract foreground counting knowledge without interference from the background. These methods can be seamlessly integrated with existing counting advances and are adaptable to different loss functions. We demonstrate complementary effects of the approaches, allowing us to achieve robust counting results even in challenging scenarios such as background clutter, occlusion, and varying crowd densities. Our proposed approach achieves strong counting results on multiple datasets, including ShanghaiTech Part_A and Part_B, UCF_QNRF, JHU-Crowd++, and NWPU-Crowd.},
howpublished = {arXiv:2306.05129},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
This work considers supervised learning to count from images and their corresponding point annotations. Where density-based counting methods typically use the point annotations only to create Gaussian-density maps, which act as the supervision signal, the starting point of this work is that point annotations have counting potential beyond density map generation. We introduce two methods that repurpose the available point annotations to enhance counting performance. The first is a counting-specific augmentation that leverages point annotations to simulate occluded objects in both input and density images to enhance the network's robustness to occlusions. The second method, foreground distillation, generates foreground masks from the point annotations, from which we train an auxiliary network on images with blacked-out backgrounds. By doing so, it learns to extract foreground counting knowledge without interference from the background. These methods can be seamlessly integrated with existing counting advances and are adaptable to different loss functions. We demonstrate complementary effects of the approaches, allowing us to achieve robust counting results even in challenging scenarios such as background clutter, occlusion, and varying crowd densities. Our proposed approach achieves strong counting results on multiple datasets, including ShanghaiTech Part_A and Part_B, UCF_QNRF, JHU-Crowd++, and NWPU-Crowd. |
| Yunhua Zhang, Hazel Doughty, Cees G M Snoek: Low-Resource Vision Challenges for Foundation Models. In: CVPR, 2024, (Best paper FGVC2024 workshop.). @inproceedings{ZhangCVPR2024,
title = {Low-Resource Vision Challenges for Foundation Models},
author = {Yunhua Zhang and Hazel Doughty and Cees G M Snoek},
url = {https://arxiv.org/abs/2401.04716
https://xiaobai1217.github.io/Low-Resource-Vision/
https://uvaauas.figshare.com/articles/dataset/Low-Resource_Image_Transfer_Evaluation_Benchmark/25577145},
year = {2024},
date = {2024-06-17},
urldate = {2024-06-17},
booktitle = {CVPR},
abstract = {Low-resource settings are well-established in natural language processing, where many languages lack sufficient data for machine learning at scale. However, low-resource problems are under-explored in computer vision. In this paper, we strive to address this gap and explore the challenges of low-resource image tasks with vision foundation models. Thus, we first collect a benchmark of genuinely low-resource image data, covering historic maps, circuit diagrams, and mechanical drawings. These low-resource settings all share the three challenges of data scarcity, fine-grained differences, and the distribution shift from natural images to the specialized domain of interest. While existing foundation models have shown impressive generalizability, we find they cannot transfer well to our low-resource tasks. To begin to tackle the challenges of low-resource vision, we introduce one simple baseline per challenge. Specifically, we propose to i) enlarge the data space by generative models, ii) adopt the best sub-kernels to encode local regions for fine-grained difference discovery and iii) learn attention for specialized domains. Experiments on the three low-resource data sources in our benchmark demonstrate our proposals already provide a better baseline than common transfer learning, data augmentation, and fine-grained methods. This highlights the unique characteristics and challenges of low-resource vision for foundation models that warrant further investigation.},
howpublished = {arXiv:2401.04716},
note = {Best paper FGVC2024 workshop.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Low-resource settings are well-established in natural language processing, where many languages lack sufficient data for machine learning at scale. However, low-resource problems are under-explored in computer vision. In this paper, we strive to address this gap and explore the challenges of low-resource image tasks with vision foundation models. Thus, we first collect a benchmark of genuinely low-resource image data, covering historic maps, circuit diagrams, and mechanical drawings. These low-resource settings all share the three challenges of data scarcity, fine-grained differences, and the distribution shift from natural images to the specialized domain of interest. While existing foundation models have shown impressive generalizability, we find they cannot transfer well to our low-resource tasks. To begin to tackle the challenges of low-resource vision, we introduce one simple baseline per challenge. Specifically, we propose to i) enlarge the data space by generative models, ii) adopt the best sub-kernels to encode local regions for fine-grained difference discovery and iii) learn attention for specialized domains. Experiments on the three low-resource data sources in our benchmark demonstrate our proposals already provide a better baseline than common transfer learning, data augmentation, and fine-grained methods. This highlights the unique characteristics and challenges of low-resource vision for foundation models that warrant further investigation. |
| Michael Dorkenwald, Nimrod Barazani, Cees G M Snoek, Yuki M Asano: PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs. In: CVPR, 2024. @inproceedings{DorkenwaldCVPR2024,
title = {PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs},
author = {Michael Dorkenwald and Nimrod Barazani and Cees G M Snoek and Yuki M Asano},
url = {https://quva-lab.github.io/PIN/
https://arxiv.org/abs/2402.08657},
year = {2024},
date = {2024-06-17},
urldate = {2024-02-13},
booktitle = {CVPR},
abstract = {Vision-Language Models (VLMs), such as Flamingo and GPT-4V, have shown immense potential by integrating large language models with vision systems. Nevertheless, these models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions without explicit spatial grounding. While it is possible to construct custom, supervised training pipelines with bounding box annotations that integrate with VLMs, these result in specialized and hard-to-scale models. In this paper, we aim to explore the limits of caption-based VLMs and instead propose to tackle the challenge in a simpler manner by i) keeping the weights of a caption-based VLM frozen and ii) not using any supervised detection data. To this end, we introduce an input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing a minimal set of parameters that are slid inside the frozen VLM, unlocking object localisation capabilities. Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads. Our experiments demonstrate strong zero-shot localisation performances on a variety of images, including Pascal VOC, COCO, LVIS, and diverse images like paintings or cartoons.},
howpublished = {arXiv:2402.08657},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Vision-Language Models (VLMs), such as Flamingo and GPT-4V, have shown immense potential by integrating large language models with vision systems. Nevertheless, these models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions without explicit spatial grounding. While it is possible to construct custom, supervised training pipelines with bounding box annotations that integrate with VLMs, these result in specialized and hard-to-scale models. In this paper, we aim to explore the limits of caption-based VLMs and instead propose to tackle the challenge in a simpler manner by i) keeping the weights of a caption-based VLM frozen and ii) not using any supervised detection data. To this end, we introduce an input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing a minimal set of parameters that are slid inside the frozen VLM, unlocking object localisation capabilities. Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads. Our experiments demonstrate strong zero-shot localisation performances on a variety of images, including Pascal VOC, COCO, LVIS, and diverse images like paintings or cartoons. |
| Zehao Xiao, Jiayi Shen, Mohammad Mahdi Derakhshani, Shengcai Liao, Cees G M Snoek: Any-Shift Prompting for Generalization over Distributions. In: CVPR, 2024. @inproceedings{XiaoCVPR2024,
title = {Any-Shift Prompting for Generalization over Distributions},
author = {Zehao Xiao and Jiayi Shen and Mohammad Mahdi Derakhshani and Shengcai Liao and Cees G M Snoek},
url = {https://arxiv.org/abs/2402.10099},
year = {2024},
date = {2024-06-17},
urldate = {2024-02-15},
booktitle = {CVPR},
abstract = {Image-language models with prompt learning have shown remarkable advances in numerous downstream vision tasks. Nevertheless, conventional prompt learning methods overfit their training distribution and lose the generalization ability on test distributions. To improve generalization across various distribution shifts, we propose any-shift prompting: a general probabilistic inference framework that considers the relationship between training and test distributions during prompt learning. We explicitly connect training and test distributions in the latent space by constructing training and test prompts in a hierarchical architecture. Within this framework, the test prompt exploits the distribution relationships to guide the generalization of the CLIP image-language model from training to any test distribution. To effectively encode the distribution information and their relationships, we further introduce a transformer inference network with a pseudo-shift training mechanism. The network generates the tailored test prompt with both training and test information in a feedforward pass, avoiding extra training costs at test time. Extensive experiments on twenty-three datasets demonstrate the effectiveness of any-shift prompting on the generalization over various distribution shifts.},
howpublished = {arXiv:2402.10099},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Image-language models with prompt learning have shown remarkable advances in numerous downstream vision tasks. Nevertheless, conventional prompt learning methods overfit their training distribution and lose the generalization ability on test distributions. To improve generalization across various distribution shifts, we propose any-shift prompting: a general probabilistic inference framework that considers the relationship between training and test distributions during prompt learning. We explicitly connect training and test distributions in the latent space by constructing training and test prompts in a hierarchical architecture. Within this framework, the test prompt exploits the distribution relationships to guide the generalization of the CLIP image-language model from training to any test distribution. To effectively encode the distribution information and their relationships, we further introduce a transformer inference network with a pseudo-shift training mechanism. The network generates the tailored test prompt with both training and test information in a feedforward pass, avoiding extra training costs at test time. Extensive experiments on twenty-three datasets demonstrate the effectiveness of any-shift prompting on the generalization over various distribution shifts. |
| Duy-Kien Nguyen, Mahmoud Assran, Unnat Jain, Martin R Oswald, Cees G M Snoek, Xinlei Chen: An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels. arXiv:2406.09415, 2024. @unpublished{NguyenArxiv2024,
title = {An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels},
author = {Duy-Kien Nguyen and Mahmoud Assran and Unnat Jain and Martin R Oswald and Cees G M Snoek and Xinlei Chen},
url = {https://arxiv.org/abs/2406.09415},
year = {2024},
date = {2024-06-13},
abstract = {This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias -- locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a token and achieve highly performant results. This is substantially different from the popular design in Vision Transformer, which maintains the inductive bias from ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a token). We mainly showcase the effectiveness of pixels-as-tokens across three well-studied tasks in computer vision: supervised learning for object classification, self-supervised learning via masked autoencoding, and image generation with diffusion models. Although directly operating on individual pixels is less computationally practical, we believe the community must be aware of this surprising piece of knowledge when devising the next generation of neural architectures for computer vision.},
howpublished = {arXiv:2406.09415},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias -- locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a token and achieve highly performant results. This is substantially different from the popular design in Vision Transformer, which maintains the inductive bias from ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a token). We mainly showcase the effectiveness of pixels-as-tokens across three well-studied tasks in computer vision: supervised learning for object classification, self-supervised learning via masked autoencoding, and image generation with diffusion models. Although directly operating on individual pixels is less computationally practical, we believe the community must be aware of this surprising piece of knowledge when devising the next generation of neural architectures for computer vision. |
| Sarah Rastegar, Hazel Doughty, Cees G M Snoek: Background No More: Action Recognition Across Domains by Causal Interventions. In: Computer Vision and Image Understanding, vol. 242, 2024. @article{RastegarCVIU2024,
title = {Background No More: Action Recognition Across Domains by Causal Interventions},
author = {Sarah Rastegar and Hazel Doughty and Cees G M Snoek},
url = {https://doi.org/10.1016/j.cviu.2024.103975},
year = {2024},
date = {2024-05-01},
urldate = {2024-01-01},
journal = {Computer Vision and Image Understanding},
volume = {242},
abstract = {We aim to recognize actions under an appearance distribution-shift between a source training-domain and target test-domain. To enable such video domain generalization, our key idea is to intervene on the action to remove the confounding effect of the domain-background on the class label using causal inference. Towards this, we propose to learn a causally debiased model on a source domain that intervenes on the action through three possible $Do$-operators which separate the action and background. To better align the source and target distributions we also introduce a test-time action intervention. Experiments on two challenging video domain generalization benchmarks reveal that causal inference is a promising tool for action recognition as it already achieves state-of-the-art results on Kinetics2Mimetics, the benchmark with the largest domain shift.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
We aim to recognize actions under an appearance distribution-shift between a source training-domain and target test-domain. To enable such video domain generalization, our key idea is to intervene on the action to remove the confounding effect of the domain-background on the class label using causal inference. Towards this, we propose to learn a causally debiased model on a source domain that intervenes on the action through three possible $Do$-operators which separate the action and background. To better align the source and target distributions we also introduce a test-time action intervention. Experiments on two challenging video domain generalization benchmarks reveal that causal inference is a promising tool for action recognition as it already achieves state-of-the-art results on Kinetics2Mimetics, the benchmark with the largest domain shift. |
| Duy-Kien Nguyen, Vaibhav Aggarwal, Yanghao Li, Martin R Oswald, Alexander Kirillov, Cees G M Snoek, Xinlei Chen: R-MAE: Regions Meet Masked Autoencoders. In: ICLR, 2024. @inproceedings{NguyenICLR2024,
title = {R-MAE: Regions Meet Masked Autoencoders},
author = {Duy-Kien Nguyen and Vaibhav Aggarwal and Yanghao Li and Martin R Oswald and Alexander Kirillov and Cees G M Snoek and Xinlei Chen},
url = {https://arxiv.org/abs/2306.05411
https://github.com/facebookresearch/r-mae},
year = {2024},
date = {2024-05-01},
urldate = {2024-05-01},
booktitle = {ICLR},
abstract = {In this work, we explore regions as a potential visual analogue of words for self-supervised image representation learning. Inspired by Masked Autoencoding (MAE), a generative pre-training baseline, we propose masked region autoencoding to learn from groups of pixels or regions. Specifically, we design an architecture which efficiently addresses the one-to-many mapping between images and regions, while being highly effective especially with high-quality regions. When integrated with MAE, our approach (R-MAE) demonstrates consistent improvements across various pre-training datasets and downstream detection and segmentation benchmarks, with negligible computational overheads. Beyond the quantitative evaluation, our analysis indicates the models pre-trained with masked region autoencoding unlock the potential for interactive segmentation.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this work, we explore regions as a potential visual analogue of words for self-supervised image representation learning. Inspired by Masked Autoencoding (MAE), a generative pre-training baseline, we propose masked region autoencoding to learn from groups of pixels or regions. Specifically, we design an architecture which efficiently addresses the one-to-many mapping between images and regions, while being highly effective especially with high-quality regions. When integrated with MAE, our approach (R-MAE) demonstrates consistent improvements across various pre-training datasets and downstream detection and segmentation benchmarks, with negligible computational overheads. Beyond the quantitative evaluation, our analysis indicates the models pre-trained with masked region autoencoding unlock the potential for interactive segmentation. |
| Miltiadis Kofinas, Boris Knyazev, Yan Zhang, Yunlu Chen, Gertjan J Burghouts, Efstratios Gavves, Cees G M Snoek, David W Zhang: Graph Neural Networks for Learning Equivariant Representations of Neural Networks. In: ICLR, 2024, (Oral presentation). @inproceedings{KofinasICLR2024,
title = {Graph Neural Networks for Learning Equivariant Representations of Neural Networks},
author = {Miltiadis Kofinas and Boris Knyazev and Yan Zhang and Yunlu Chen and Gertjan J Burghouts and Efstratios Gavves and Cees G M Snoek and David W Zhang},
url = {https://github.com/mkofinas/neural-graphs
https://arxiv.org/abs/2403.12143},
year = {2024},
date = {2024-05-01},
urldate = {2024-05-01},
booktitle = {ICLR},
abstract = {Neural networks that process the parameters of other neural networks find applications in domains as diverse as classifying implicit neural representations, generating neural network weights, and predicting generalization errors. However, existing approaches either overlook the inherent permutation symmetry in the neural network or rely on intricate weight-sharing patterns to achieve equivariance, while ignoring the impact of the network architecture itself. In this work, we propose to represent neural networks as computational graphs of parameters, which allows us to harness powerful graph neural networks and transformers that preserve permutation symmetry. Consequently, our approach enables a single model to encode neural computational graphs with diverse architectures. We showcase the effectiveness of our method on a wide range of tasks, including classification and editing of implicit neural representations, predicting generalization performance, and learning to optimize, while consistently outperforming state-of-the-art methods.},
note = {Oral presentation},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Neural networks that process the parameters of other neural networks find applications in domains as diverse as classifying implicit neural representations, generating neural network weights, and predicting generalization errors. However, existing approaches either overlook the inherent permutation symmetry in the neural network or rely on intricate weight-sharing patterns to achieve equivariance, while ignoring the impact of the network architecture itself. In this work, we propose to represent neural networks as computational graphs of parameters, which allows us to harness powerful graph neural networks and transformers that preserve permutation symmetry. Consequently, our approach enables a single model to encode neural computational graphs with diverse architectures. We showcase the effectiveness of our method on a wide range of tasks, including classification and editing of implicit neural representations, predicting generalization performance, and learning to optimize, while consistently outperforming state-of-the-art methods. |
| Wenfang Sun, Yingjun Du, Gaowen Liu, Ramana Kompella, Cees G M Snoek: Training-Free Semantic Segmentation via LLM-Supervision. arXiv:2404.00701, 2024. @unpublished{SunArxiv2024,
title = {Training-Free Semantic Segmentation via LLM-Supervision},
author = {Wenfang Sun and Yingjun Du and Gaowen Liu and Ramana Kompella and Cees G M Snoek},
url = {https://arxiv.org/abs/2404.00701},
year = {2024},
date = {2024-04-01},
abstract = {Recent advancements in open vocabulary models, like CLIP, have notably advanced zero-shot classification and segmentation by utilizing natural language for class-specific embeddings. However, most research has focused on improving model accuracy through prompt engineering, prompt learning, or fine-tuning with limited labeled data, thereby overlooking the importance of refining the class descriptors. This paper introduces a new approach to text-supervised semantic segmentation using supervision by a large language model (LLM) that does not require extra training. Our method starts from an LLM, like GPT-3, to generate a detailed set of subclasses for more accurate class representation. We then employ an advanced text-supervised semantic segmentation model to apply the generated subclasses as target labels, resulting in diverse segmentation results tailored to each subclass's unique characteristics. Additionally, we propose an assembly that merges the segmentation maps from the various subclass descriptors to ensure a more comprehensive representation of the different aspects in the test images. Through comprehensive experiments on three standard benchmarks, our method outperforms traditional text-supervised semantic segmentation methods by a marked margin.},
howpublished = {arXiv:2404.00701},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
Recent advancements in open vocabulary models, like CLIP, have notably advanced zero-shot classification and segmentation by utilizing natural language for class-specific embeddings. However, most research has focused on improving model accuracy through prompt engineering, prompt learning, or fine-tuning with limited labeled data, thereby overlooking the importance of refining the class descriptors. This paper introduces a new approach to text-supervised semantic segmentation using supervision by a large language model (LLM) that does not require extra training. Our method starts from an LLM, like GPT-3, to generate a detailed set of subclasses for more accurate class representation. We then employ an advanced text-supervised semantic segmentation model to apply the generated subclasses as target labels, resulting in diverse segmentation results tailored to each subclass's unique characteristics. Additionally, we propose an assembly that merges the segmentation maps from the various subclass descriptors to ensure a more comprehensive representation of the different aspects in the test images. Through comprehensive experiments on three standard benchmarks, our method outperforms traditional text-supervised semantic segmentation methods by a marked margin. |
| Vincent Tao Hu, Di Wu, Yuki M Asano, Pascal Mettes, Basura Fernando, Björn Ommer, Cees G M Snoek: Flow Matching for Conditional Text Generation in a Few Sampling Steps. In: EACL, 2024. @inproceedings{HuEACL2024,
title = {Flow Matching for Conditional Text Generation in a Few Sampling Steps},
author = {Vincent Tao Hu and Di Wu and Yuki M Asano and Pascal Mettes and Basura Fernando and Björn Ommer and Cees G M Snoek},
url = {https://aclanthology.org/2024.eacl-short.33.pdf},
year = {2024},
date = {2024-03-27},
urldate = {2024-03-27},
booktitle = {EACL},
abstract = {Diffusion models are a promising tool for high-quality text generation. However, current models face multiple drawbacks including slow sampling, noise schedule sensitivity, and misalignment between the training and sampling stages. In this paper, we introduce FlowSeq, which bypasses all current drawbacks by leveraging flow matching for conditional text generation. FlowSeq can generate text in a few steps by training with a novel anchor loss, alleviating the need for expensive hyperparameter optimization of the noise schedule prevalent in diffusion models. We extensively evaluate our proposed method and show competitive performance in tasks such as question generation, open-domain dialogue, and paraphrasing.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Diffusion models are a promising tool for high-quality text generation. However, current models face multiple drawbacks including slow sampling, noise schedule sensitivity, and misalignment between the training and sampling stages. In this paper, we introduce FlowSeq, which bypasses all current drawbacks by leveraging flow matching for conditional text generation. FlowSeq can generate text in a few steps by training with a novel anchor loss, alleviating the need for expensive hyperparameter optimization of the noise schedule prevalent in diffusion models. We extensively evaluate our proposed method and show competitive performance in tasks such as question generation, open-domain dialogue, and paraphrasing. |
| Vincent Tao Hu, David W Zhang, Mang Tang, Pascal Mettes, Deli Zhao, Cees G M Snoek: Latent Space Editing in Transformer-Based Flow Matching. In: AAAI Conference on Artificial Intelligence, 2024. @inproceedings{HuAAAI2024,
title = {Latent Space Editing in Transformer-Based Flow Matching},
author = {Vincent Tao Hu and David W Zhang and Mang Tang and Pascal Mettes and Deli Zhao and Cees G M Snoek},
url = {https://arxiv.org/abs/2312.10825},
year = {2024},
date = {2024-02-01},
urldate = {2024-02-01},
booktitle = {AAAI Conference on Artificial Intelligence},
abstract = {This paper strives for image editing via generative models. Flow Matching is an emerging generative modeling technique that offers the advantage of simple and efficient training. Simultaneously, a new transformer-based U-ViT has recently been proposed to replace the commonly used UNet for better scalability and performance in generative modeling. Hence, Flow Matching with a transformer backbone offers the potential for scalable and high-quality generative modeling, but their latent structure and editing ability are as of yet unknown. Hence, we adopt this setting and explore how to edit images through latent space manipulation. We introduce an editing space, which we call $u$-space, that can be manipulated in a controllable, accumulative, and composable manner. Additionally, we propose a tailored sampling solution to enable sampling with the more efficient adaptive step-size ODE solvers. Lastly, we put forth a straightforward yet powerful method for achieving fine-grained and nuanced editing using text prompts. Our framework is simple and efficient, all while being highly effective at editing images while preserving the essence of the original content. We will provide our source code and include it in the appendix.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper strives for image editing via generative models. Flow Matching is an emerging generative modeling technique that offers the advantage of simple and efficient training. Simultaneously, a new transformer-based U-ViT has recently been proposed to replace the commonly used UNet for better scalability and performance in generative modeling. Hence, Flow Matching with a transformer backbone offers the potential for scalable and high-quality generative modeling, but their latent structure and editing ability are as of yet unknown. Hence, we adopt this setting and explore how to edit images through latent space manipulation. We introduce an editing space, which we call $u$-space, that can be manipulated in a controllable, accumulative, and composable manner. Additionally, we propose a tailored sampling solution to enable sampling with the more efficient adaptive step-size ODE solvers. Lastly, we put forth a straightforward yet powerful method for achieving fine-grained and nuanced editing using text prompts. Our framework is simple and efficient, all while being highly effective at editing images while preserving the essence of the original content. We will provide our source code and include it in the appendix. |
| Yingjun Du, Haoliang Sun, Xiantong Zhen, Jun Xu, Yilong Yin, Ling Shao, Cees G M Snoek: MetaKernel: Learning Variational Random Features with Limited Labels. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, iss. 3, pp. 1464-1478, 2024. @article{DuPAMI24,
title = {MetaKernel: Learning Variational Random Features with Limited Labels},
author = {Yingjun Du and Haoliang Sun and Xiantong Zhen and Jun Xu and Yilong Yin and Ling Shao and Cees G M Snoek},
url = {https://arxiv.org/abs/2105.03781},
doi = {https://doi.org/10.1109/TPAMI.2022.3154930},
year = {2024},
date = {2024-01-01},
urldate = {2024-01-01},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
volume = {46},
issue = {3},
pages = {1464-1478},
abstract = {Few-shot learning deals with the fundamental and challenging problem of learning from a few annotated samples, while being able to generalize well on new tasks. The crux of few-shot learning is to extract prior knowledge from related tasks to enable fast adaptation to a new task with a limited amount of data. In this paper, we propose meta-learning kernels with random Fourier features for few-shot learning, we call MetaKernel. Specifically, we propose learning variational random features in a data-driven manner to obtain task-specific kernels by leveraging the shared knowledge provided by related tasks in a meta-learning setting. We treat the random feature basis as the latent variable, which is estimated by variational inference. The shared knowledge from related tasks is incorporated into a context inference of the posterior, which we achieve via a long-short term memory module. To establish more expressive kernels, we deploy conditional normalizing flows based on coupling layers to achieve a richer posterior distribution over random Fourier bases. The resultant kernels are more informative and discriminative, which further improves the few-shot learning. To evaluate our method, we conduct extensive experiments on both few-shot image classification and regression tasks. A thorough ablation study demonstrates that the effectiveness of each introduced component in our method. The benchmark results on fourteen datasets demonstrate MetaKernel consistently delivers at least comparable and often better performance than state-of-the-art alternatives.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Few-shot learning deals with the fundamental and challenging problem of learning from a few annotated samples, while being able to generalize well on new tasks. The crux of few-shot learning is to extract prior knowledge from related tasks to enable fast adaptation to a new task with a limited amount of data. In this paper, we propose meta-learning kernels with random Fourier features for few-shot learning, we call MetaKernel. Specifically, we propose learning variational random features in a data-driven manner to obtain task-specific kernels by leveraging the shared knowledge provided by related tasks in a meta-learning setting. We treat the random feature basis as the latent variable, which is estimated by variational inference. The shared knowledge from related tasks is incorporated into a context inference of the posterior, which we achieve via a long-short term memory module. To establish more expressive kernels, we deploy conditional normalizing flows based on coupling layers to achieve a richer posterior distribution over random Fourier bases. The resultant kernels are more informative and discriminative, which further improves the few-shot learning. To evaluate our method, we conduct extensive experiments on both few-shot image classification and regression tasks. A thorough ablation study demonstrates that the effectiveness of each introduced component in our method. The benchmark results on fourteen datasets demonstrate MetaKernel consistently delivers at least comparable and often better performance than state-of-the-art alternatives. |
2023
|
| Vincent Tao Hu, Yunlu Chen, Mathilde Caron, Yuki M Asano, Cees G M Snoek, Bjorn Ommer: Guided Diffusion from Self-Supervised Diffusion Features. arXiv:2312.08825, 2023. @unpublished{HuArxive2023,
title = {Guided Diffusion from Self-Supervised Diffusion Features},
author = {Vincent Tao Hu and Yunlu Chen and Mathilde Caron and Yuki M Asano and Cees G M Snoek and Bjorn Ommer},
url = {https://browse.arxiv.org/abs/2312.08825},
year = {2023},
date = {2023-12-14},
urldate = {2023-12-14},
abstract = {Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or classifier pretraining. That is why guidance was harnessed from self-supervised learning backbones, like DINO. However, recent studies have revealed that the feature representation derived from diffusion model itself is discriminative for numerous downstream tasks as well, which prompts us to propose a framework to extract guidance from, and specifically for, diffusion models. Our research has yielded several significant contributions. Firstly, the guidance signals from diffusion models are on par with those from class-conditioned diffusion models. Secondly, feature regularization, when based on the Sinkhorn-Knopp algorithm, can further enhance feature discriminability in comparison to unconditional diffusion models. Thirdly, we have constructed an online training approach that can concurrently derive guidance from diffusion models for diffusion models. Lastly, we have extended the application of diffusion models along the constant velocity path of ODE to achieve a more favorable balance between sampling steps and fidelity. The performance of our methods has been outstanding, outperforming related baseline comparisons in large-resolution datasets, such as ImageNet256, ImageNet256-100 and LSUN-Churches. Our code will be released.},
howpublished = {arXiv:2312.08825},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or classifier pretraining. That is why guidance was harnessed from self-supervised learning backbones, like DINO. However, recent studies have revealed that the feature representation derived from diffusion model itself is discriminative for numerous downstream tasks as well, which prompts us to propose a framework to extract guidance from, and specifically for, diffusion models. Our research has yielded several significant contributions. Firstly, the guidance signals from diffusion models are on par with those from class-conditioned diffusion models. Secondly, feature regularization, when based on the Sinkhorn-Knopp algorithm, can further enhance feature discriminability in comparison to unconditional diffusion models. Thirdly, we have constructed an online training approach that can concurrently derive guidance from diffusion models for diffusion models. Lastly, we have extended the application of diffusion models along the constant velocity path of ODE to achieve a more favorable balance between sampling steps and fidelity. The performance of our methods has been outstanding, outperforming related baseline comparisons in large-resolution datasets, such as ImageNet256, ImageNet256-100 and LSUN-Churches. Our code will be released. |
| Vincent Tao Hu, Wenzhe Yin, Pingchuan Ma, Yunlu Chen, Basura Fernando, Yuki M Asano, Efstratios Gavves, Pascal Mettes, Bjorn Ommer, Cees G. M. Snoek: Motion Flow Matching for Human Motion Synthesis and Editing. arXiv:2312.08895, 2023. @unpublished{HuArxive2023b,
title = {Motion Flow Matching for Human Motion Synthesis and Editing},
author = {Vincent Tao Hu and Wenzhe Yin and Pingchuan Ma and Yunlu Chen and Basura Fernando and Yuki M Asano and Efstratios Gavves and Pascal Mettes and Bjorn Ommer and Cees G. M. Snoek},
url = {https://browse.arxiv.org/abs/2312.08895},
year = {2023},
date = {2023-12-14},
urldate = {2023-12-14},
abstract = {Human motion synthesis is a fundamental task in computer animation. Recent methods based on diffusion models or GPT structure demonstrate commendable performance but exhibit drawbacks in terms of slow sampling speeds and error accumulation. In this paper, we propose emph{Motion Flow Matching}, a novel generative model designed for human motion generation featuring efficient sampling and effectiveness in motion editing applications. Our method reduces the sampling complexity from thousand steps in previous diffusion models to just ten steps, while achieving comparable performance in text-to-motion and action-to-motion generation benchmarks. Noticeably, our approach establishes a new state-of-the-art Fréchet Inception Distance on the KIT-ML dataset. What is more, we tailor a straightforward motion editing paradigm named emph{sampling trajectory rewriting} leveraging the ODE-style generative models and apply it to various editing scenarios including motion prediction, motion in-between prediction, motion interpolation, and upper-body editing. Our code will be released.},
howpublished = {arXiv:2312.08895},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
Human motion synthesis is a fundamental task in computer animation. Recent methods based on diffusion models or GPT structure demonstrate commendable performance but exhibit drawbacks in terms of slow sampling speeds and error accumulation. In this paper, we propose emph{Motion Flow Matching}, a novel generative model designed for human motion generation featuring efficient sampling and effectiveness in motion editing applications. Our method reduces the sampling complexity from thousand steps in previous diffusion models to just ten steps, while achieving comparable performance in text-to-motion and action-to-motion generation benchmarks. Noticeably, our approach establishes a new state-of-the-art Fréchet Inception Distance on the KIT-ML dataset. What is more, we tailor a straightforward motion editing paradigm named emph{sampling trajectory rewriting} leveraging the ODE-style generative models and apply it to various editing scenarios including motion prediction, motion in-between prediction, motion interpolation, and upper-body editing. Our code will be released. |
| Mohammad Mahdi Derakhshani, Menglin Xia, Harkirat Behl, Cees G M Snoek, Victor Rühle: Unlocking Spatial Comprehension in Text-to-Image Diffusion Models. arXiv:2311.17937, 2023. @unpublished{DerakhshaniArxive2023b,
title = {Unlocking Spatial Comprehension in Text-to-Image Diffusion Models},
author = {Mohammad Mahdi Derakhshani and Menglin Xia and Harkirat Behl and Cees G M Snoek and Victor Rühle},
url = {https://arxiv.org/abs/2311.17937},
year = {2023},
date = {2023-11-28},
urldate = {2023-11-28},
abstract = {We propose CompFuser, an image generation pipeline that enhances spatial comprehension and attribute assignment in text-to-image generative models. Our pipeline enables the interpretation of instructions defining spatial relationships between objects in a scene, such as `An image of a gray cat on the left of an orange dog', and generate corresponding images. This is especially important in order to provide more control to the user. CompFuser overcomes the limitation of existing text-to-image diffusion models by decoding the generation of multiple objects into iterative steps: first generating a single object and then editing the image by placing additional objects in their designated positions. To create training data for spatial comprehension and attribute assignment we introduce a synthetic data generation process, that leverages a frozen large language model and a frozen layout-based diffusion model for object placement. We compare our approach to strong baselines and show that our model outperforms state-of-the-art image generation models in spatial comprehension and attribute assignment, despite being 3x to 5x smaller in parameters.},
howpublished = {arXiv:2311.17937},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
We propose CompFuser, an image generation pipeline that enhances spatial comprehension and attribute assignment in text-to-image generative models. Our pipeline enables the interpretation of instructions defining spatial relationships between objects in a scene, such as `An image of a gray cat on the left of an orange dog', and generate corresponding images. This is especially important in order to provide more control to the user. CompFuser overcomes the limitation of existing text-to-image diffusion models by decoding the generation of multiple objects into iterative steps: first generating a single object and then editing the image by placing additional objects in their designated positions. To create training data for spatial comprehension and attribute assignment we introduce a synthetic data generation process, that leverages a frozen large language model and a frozen layout-based diffusion model for object placement. We compare our approach to strong baselines and show that our model outperforms state-of-the-art image generation models in spatial comprehension and attribute assignment, despite being 3x to 5x smaller in parameters. |
| Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Cees G M Snoek, Marcel Worring, Yuki M Asano: Small Visual Language Models can also be Open-Ended Few-Shot Learners. arXiv:2310.00500, 2023. @unpublished{DerakhshaniArxive2023,
title = {Small Visual Language Models can also be Open-Ended Few-Shot Learners},
author = {Mohammad Mahdi Derakhshani and Ivona Najdenkoska and Cees G M Snoek and Marcel Worring and Yuki M Asano},
url = {https://arxiv.org/abs/2310.00500},
year = {2023},
date = {2023-09-30},
urldate = {2023-09-30},
abstract = {We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks open-ended few-shot abilities of small visual language models. Our proposed adaptation algorithm explicitly learns from symbolic, yet self-supervised training tasks. Specifically, our approach imitates image captions in a self-supervised way based on clustering a large pool of images followed by assigning semantically-unrelated names to clusters. By doing so, we construct the `self-context', a training signal consisting of interleaved sequences of image and pseudo-caption pairs and a query image for which the model is trained to produce the right pseudo-caption. We demonstrate the performance and flexibility of SeCAt on several multimodal few-shot datasets, spanning various granularities. By using models with approximately 1B parameters we outperform the few-shot abilities of much larger models, such as Frozen and FROMAGe. SeCAt opens new possibilities for research in open-ended few-shot learning that otherwise requires access to large or proprietary models.},
howpublished = {arXiv:2310.00500},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks open-ended few-shot abilities of small visual language models. Our proposed adaptation algorithm explicitly learns from symbolic, yet self-supervised training tasks. Specifically, our approach imitates image captions in a self-supervised way based on clustering a large pool of images followed by assigning semantically-unrelated names to clusters. By doing so, we construct the `self-context', a training signal consisting of interleaved sequences of image and pseudo-caption pairs and a query image for which the model is trained to produce the right pseudo-caption. We demonstrate the performance and flexibility of SeCAt on several multimodal few-shot datasets, spanning various granularities. By using models with approximately 1B parameters we outperform the few-shot abilities of much larger models, such as Frozen and FROMAGe. SeCAt opens new possibilities for research in open-ended few-shot learning that otherwise requires access to large or proprietary models. |
| Yingjun Du, Zehao Xiao, Shengcai Liao, Cees G M Snoek: ProtoDiff: Learning to Learn Prototypical Networks by Task-Guided Diffusion. In: NeurIPS, 2023. @inproceedings{DuNeurips2023,
title = {ProtoDiff: Learning to Learn Prototypical Networks by Task-Guided Diffusion},
author = {Yingjun Du and Zehao Xiao and Shengcai Liao and Cees G M Snoek},
url = {https://arxiv.org/abs/2306.14770},
year = {2023},
date = {2023-09-23},
urldate = {2023-06-27},
booktitle = {NeurIPS},
abstract = {Prototype-based meta-learning has emerged as a powerful technique for addressing few-shot learning challenges. However, estimating a deterministic prototype using a simple average function from a limited number of examples remains a fragile process. To overcome this limitation, we introduce ProtoDiff, a novel framework that leverages a task-guided diffusion model during the meta-training phase to gradually generate prototypes, thereby providing efficient class representations. Specifically, a set of prototypes is optimized to achieve per-task prototype overfitting, enabling accurately obtaining the overfitted prototypes for individual tasks. Furthermore, we introduce a task-guided diffusion process within the prototype space, enabling the meta-learning of a generative process that transitions from a vanilla prototype to an overfitted prototype. ProtoDiff gradually generates task-specific prototypes from random noise during the meta-test stage, conditioned on the limited samples available for the new task. Furthermore, to expedite training and enhance ProtoDiff's performance, we propose the utilization of residual prototype learning, which leverages the sparsity of the residual prototype. We conduct thorough ablation studies to demonstrate its ability to accurately capture the underlying prototype distribution and enhance generalization. The new state-of-the-art performance on within-domain, cross-domain, and few-task few-shot classification further substantiates the benefit of ProtoDiff.},
howpublished = {arXiv:2306.14770},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Prototype-based meta-learning has emerged as a powerful technique for addressing few-shot learning challenges. However, estimating a deterministic prototype using a simple average function from a limited number of examples remains a fragile process. To overcome this limitation, we introduce ProtoDiff, a novel framework that leverages a task-guided diffusion model during the meta-training phase to gradually generate prototypes, thereby providing efficient class representations. Specifically, a set of prototypes is optimized to achieve per-task prototype overfitting, enabling accurately obtaining the overfitted prototypes for individual tasks. Furthermore, we introduce a task-guided diffusion process within the prototype space, enabling the meta-learning of a generative process that transitions from a vanilla prototype to an overfitted prototype. ProtoDiff gradually generates task-specific prototypes from random noise during the meta-test stage, conditioned on the limited samples available for the new task. Furthermore, to expedite training and enhance ProtoDiff's performance, we propose the utilization of residual prototype learning, which leverages the sparsity of the residual prototype. We conduct thorough ablation studies to demonstrate its ability to accurately capture the underlying prototype distribution and enhance generalization. The new state-of-the-art performance on within-domain, cross-domain, and few-task few-shot classification further substantiates the benefit of ProtoDiff. |
| Sarah Rastegar, Hazel Doughty, Cees G M Snoek: Learn to Categorize or Categorize to Learn? Self-Coding for Generalized Category Discovery. In: NeurIPS, 2023. @inproceedings{RastegarNeurips2023,
title = {Learn to Categorize or Categorize to Learn? Self-Coding for Generalized Category Discovery},
author = {Sarah Rastegar and Hazel Doughty and Cees G M Snoek},
url = {https://arxiv.org/abs/2310.19776
https://github.com/SarahRastegar/InfoSieve},
year = {2023},
date = {2023-09-23},
urldate = {2023-09-23},
booktitle = {NeurIPS},
abstract = {In the quest for unveiling novel categories at test time, we confront the inherent limitations of traditional supervised recognition models that are restricted by a predefined category set. While strides have been made in the realms of self-supervised and open-world learning towards test-time category discovery, a crucial yet often overlooked question persists: what exactly delineates a category? In this paper, we conceptualize a category through the lens of optimization, viewing it as an optimal solution to a well-defined problem. Harnessing this unique conceptualization, we propose a novel, efficient and self-supervised method capable of discovering previously unknown categories at test time. A salient feature of our approach is the assignment of minimum length category codes to individual data instances, which encapsulates the implicit category hierarchy prevalent in real-world datasets. This mechanism affords us enhanced control over category granularity, thereby equipping our model to handle fine-grained categories adeptly. Experimental evaluations, bolstered by state-of-the-art benchmark comparisons, testify to the efficacy of our solution in managing unknown categories at test time. Furthermore, we fortify our proposition with a theoretical foundation, providing proof of its optimality. },
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In the quest for unveiling novel categories at test time, we confront the inherent limitations of traditional supervised recognition models that are restricted by a predefined category set. While strides have been made in the realms of self-supervised and open-world learning towards test-time category discovery, a crucial yet often overlooked question persists: what exactly delineates a category? In this paper, we conceptualize a category through the lens of optimization, viewing it as an optimal solution to a well-defined problem. Harnessing this unique conceptualization, we propose a novel, efficient and self-supervised method capable of discovering previously unknown categories at test time. A salient feature of our approach is the assignment of minimum length category codes to individual data instances, which encapsulates the implicit category hierarchy prevalent in real-world datasets. This mechanism affords us enhanced control over category granularity, thereby equipping our model to handle fine-grained categories adeptly. Experimental evaluations, bolstered by state-of-the-art benchmark comparisons, testify to the efficacy of our solution in managing unknown categories at test time. Furthermore, we fortify our proposition with a theoretical foundation, providing proof of its optimality. |
| Yunhua Zhang, Hazel Doughty, Cees G M Snoek: Learning Unseen Modality Interaction. In: NeurIPS, 2023. @inproceedings{ZhangNeurips2023,
title = {Learning Unseen Modality Interaction},
author = {Yunhua Zhang and Hazel Doughty and Cees G M Snoek},
url = {https://arxiv.org/abs/2306.12795
https://xiaobai1217.github.io/Unseen-Modality-Interaction/},
year = {2023},
date = {2023-09-22},
urldate = {2023-09-22},
booktitle = {NeurIPS},
abstract = {Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved. This allows the information to be accumulated with a simple summation operation across available modalities. To reduce overfitting to less discriminative modality combinations during training, we further improve the model learning with pseudo-supervision indicating the reliability of a modality's prediction. We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval.},
howpublished = {arXiv:2306.12795},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved. This allows the information to be accumulated with a simple summation operation across available modalities. To reduce overfitting to less discriminative modality combinations during training, we further improve the model learning with pseudo-supervision indicating the reliability of a modality's prediction. We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval. |
| Mohammad Mahdi Derakhshani, Enrique Sanchez, Adrian Bulat, Victor Guilherme Turrisi da Costa, Cees G M Snoek, Georgios Tzimiropoulos, Brais Martinez: Bayesian Prompt Learning for Image-Language Model Generalization. In: ICCV, 2023. @inproceedings{DerakhshaniICCV2023,
title = {Bayesian Prompt Learning for Image-Language Model Generalization},
author = {Mohammad Mahdi Derakhshani and Enrique Sanchez and Adrian Bulat and Victor Guilherme Turrisi da Costa and Cees G M Snoek and Georgios Tzimiropoulos and Brais Martinez},
url = {https://arxiv.org/abs/2210.02390},
year = {2023},
date = {2023-07-14},
urldate = {2023-03-14},
booktitle = {ICCV},
abstract = {Foundational image-language models have generated considerable interest due to their efficient adaptation to downstream tasks by prompt learning. Prompt learning treats part of the language model input as trainable while freezing the rest, and optimizes an Empirical Risk Minimization objective. However, Empirical Risk Minimization is known to suffer from distributional shifts which hurt generalizability to prompts unseen during training. By leveraging the regularization ability of Bayesian methods, we frame prompt learning from the Bayesian perspective and formulate it as a variational inference problem. Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts. Our framework is implemented by modeling the input prompt space in a probabilistic manner, as an a priori distribution which makes our proposal compatible with prompt learning approaches that are unconditional or conditional on the image. We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space, prevents learning spurious features, and exploits transferable invariant features. This results in better generalization of unseen prompts, even across different datasets and domains.},
howpublished = {arXiv:2210.02390},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Foundational image-language models have generated considerable interest due to their efficient adaptation to downstream tasks by prompt learning. Prompt learning treats part of the language model input as trainable while freezing the rest, and optimizes an Empirical Risk Minimization objective. However, Empirical Risk Minimization is known to suffer from distributional shifts which hurt generalizability to prompts unseen during training. By leveraging the regularization ability of Bayesian methods, we frame prompt learning from the Bayesian perspective and formulate it as a variational inference problem. Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts. Our framework is implemented by modeling the input prompt space in a probabilistic manner, as an a priori distribution which makes our proposal compatible with prompt learning approaches that are unconditional or conditional on the image. We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space, prevents learning spurious features, and exploits transferable invariant features. This results in better generalization of unseen prompts, even across different datasets and domains. |
| Aritra Bhowmik, Martin R Oswald, Yu Wang, Nora Baka, Cees G M Snoek: Detecting Objects with Graph Priors and Graph Refinement. In: ICCV, 2023. @inproceedings{BhowmikICCV2023,
title = {Detecting Objects with Graph Priors and Graph Refinement},
author = {Aritra Bhowmik and Martin R Oswald and Yu Wang and Nora Baka and Cees G M Snoek},
url = {https://arxiv.org/abs/2212.12395},
year = {2023},
date = {2023-07-14},
urldate = {2022-12-23},
booktitle = {ICCV},
abstract = {The goal of this paper is to detect objects by exploiting their interrelationships. Rather than relying on predefined and labeled graph structures, we infer a graph prior from object co-occurrence statistics. The key idea of our paper is to model object relations as a function of initial class predictions and co-occurrence priors to generate a graph representation of an image for improved classification and bounding box regression. We additionally learn the object-relation joint distribution via energy based modeling. Sampling from this distribution generates a refined graph representation of the image which in turn produces improved detection performance. Experiments on the Visual Genome and MS-COCO datasets demonstrate our method is detector agnostic, end-to-end trainable, and especially beneficial for rare object classes. What is more, we establish a consistent improvement over object detectors like DETR and Faster-RCNN, as well as state-of-the-art methods modeling object interrelationships.},
howpublished = {arXiv:2212.12395},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
The goal of this paper is to detect objects by exploiting their interrelationships. Rather than relying on predefined and labeled graph structures, we infer a graph prior from object co-occurrence statistics. The key idea of our paper is to model object relations as a function of initial class predictions and co-occurrence priors to generate a graph representation of an image for improved classification and bounding box regression. We additionally learn the object-relation joint distribution via energy based modeling. Sampling from this distribution generates a refined graph representation of the image which in turn produces improved detection performance. Experiments on the Visual Genome and MS-COCO datasets demonstrate our method is detector agnostic, end-to-end trainable, and especially beneficial for rare object classes. What is more, we establish a consistent improvement over object detectors like DETR and Faster-RCNN, as well as state-of-the-art methods modeling object interrelationships. |
| Fida Mohammad Thoker, Hazel Doughty, Cees G M Snoek: Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization. In: ICCV, 2023. @inproceedings{ThokerICCV2023,
title = {Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization},
author = {Fida Mohammad Thoker and Hazel Doughty and Cees G M Snoek},
url = {https://arxiv.org/abs/2303.11003},
year = {2023},
date = {2023-07-14},
urldate = {2023-03-20},
booktitle = {ICCV},
abstract = {We propose a self-supervised method for learning motion-focused video representations. Existing approaches minimize distances between temporally augmented videos, which maintain high spatial similarity. We instead propose to learn similarities between videos with identical local motion dynamics but an otherwise different appearance. We do so by adding synthetic motion trajectories to videos which we refer to as tubelets. By simulating different tubelet motions and applying transformations, such as scaling and rotation, we introduce motion patterns beyond what is present in the pretraining data. This allows us to learn a video representation that is remarkably data-efficient: our approach maintains performance when using only 25% of the pretraining videos. Experiments on 10 diverse downstream settings demonstrate our competitive performance and generalizability to new domains and fine-grained actions},
howpublished = {arXiv:2303.11003},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We propose a self-supervised method for learning motion-focused video representations. Existing approaches minimize distances between temporally augmented videos, which maintain high spatial similarity. We instead propose to learn similarities between videos with identical local motion dynamics but an otherwise different appearance. We do so by adding synthetic motion trajectories to videos which we refer to as tubelets. By simulating different tubelet motions and applying transformations, such as scaling and rotation, we introduce motion patterns beyond what is present in the pretraining data. This allows us to learn a video representation that is remarkably data-efficient: our approach maintains performance when using only 25% of the pretraining videos. Experiments on 10 diverse downstream settings demonstrate our competitive performance and generalizability to new domains and fine-grained actions |
| Pengwan Yang, Cees G M Snoek, Yuki M Asano: Self-Ordering Point Clouds. In: ICCV, 2023, (Oral presentation). @inproceedings{YangICCV2023,
title = {Self-Ordering Point Clouds},
author = {Pengwan Yang and Cees G M Snoek and Yuki M Asano},
url = {https://arxiv.org/abs/2304.00961},
year = {2023},
date = {2023-07-14},
urldate = {2023-07-14},
booktitle = {ICCV},
abstract = {In this paper we address the task of finding representative subsets of points in a 3D point cloud by means of a point-wise ordering. Only a few works have tried to address this challenging vision problem, all with the help of hard to obtain point and cloud labels. Different from these works, we introduce the task of point-wise ordering in 3D point clouds through self-supervision, which we call self-ordering. We further contribute the first end-to-end trainable network that learns a point-wise ordering in a self-supervised fashion. It utilizes a novel differentiable point scoring-sorting strategy and it constructs an hierarchical contrastive scheme to obtain self-supervision signals. We extensively ablate the method and show its scalability and superior performance even compared to supervised ordering methods on multiple datasets and tasks including zero-shot ordering of point clouds from unseen categories.},
howpublished = {arXiv:2304.00961},
note = {Oral presentation},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper we address the task of finding representative subsets of points in a 3D point cloud by means of a point-wise ordering. Only a few works have tried to address this challenging vision problem, all with the help of hard to obtain point and cloud labels. Different from these works, we introduce the task of point-wise ordering in 3D point clouds through self-supervision, which we call self-ordering. We further contribute the first end-to-end trainable network that learns a point-wise ordering in a self-supervised fashion. It utilizes a novel differentiable point scoring-sorting strategy and it constructs an hierarchical contrastive scheme to obtain self-supervision signals. We extensively ablate the method and show its scalability and superior performance even compared to supervised ordering methods on multiple datasets and tasks including zero-shot ordering of point clouds from unseen categories. |
| Mengmeng Jing, Xiantong Zhen, Jingjing Li, Cees G M Snoek: Order-preserving Consistency Regularization for Domain Adaptation and Generalization. In: ICCV, 2023. @inproceedings{JingICCV2023,
title = {Order-preserving Consistency Regularization for Domain Adaptation and Generalization},
author = {Mengmeng Jing and Xiantong Zhen and Jingjing Li and Cees G M Snoek},
url = {https://arxiv.org/abs/2309.13258},
year = {2023},
date = {2023-07-14},
urldate = {2023-07-14},
booktitle = {ICCV},
abstract = {Deep learning models fail on cross-domain challenges if the model is oversensitive to domain-specific attributes, e.g., lightning, background, camera angle, etc. To alleviate this problem, data augmentation coupled with consistency regularization are commonly adopted to make the model less sensitive to domain-specific attributes. Consistency regularization enforces the model to output the same representation or prediction for two views of one image. These constraints, however, are either too strict or not order-preserving for the classification probabilities. In this work, we propose the Order-preserving Consistency Regularization (OCR) for cross-domain tasks. The order-preserving property for the prediction makes the model robust to task-irrelevant transformations. As a result, the model becomes less sensitive to the domain-specific attributes. The comprehensive experiments show that our method achieves clear advantages on five different cross-domain tasks.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Deep learning models fail on cross-domain challenges if the model is oversensitive to domain-specific attributes, e.g., lightning, background, camera angle, etc. To alleviate this problem, data augmentation coupled with consistency regularization are commonly adopted to make the model less sensitive to domain-specific attributes. Consistency regularization enforces the model to output the same representation or prediction for two views of one image. These constraints, however, are either too strict or not order-preserving for the classification probabilities. In this work, we propose the Order-preserving Consistency Regularization (OCR) for cross-domain tasks. The order-preserving property for the prediction makes the model robust to task-irrelevant transformations. As a result, the model becomes less sensitive to the domain-specific attributes. The comprehensive experiments show that our method achieves clear advantages on five different cross-domain tasks. |
| Mohammadreza Salehi, Efstratios Gavves, Cees G M Snoek, Yuki M Asano: Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations. In: ICCV, 2023. @inproceedings{SalehiICCV2023,
title = {Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations},
author = {Mohammadreza Salehi and Efstratios Gavves and Cees G M Snoek and Yuki M Asano},
url = {https://arxiv.org/abs/2308.11796
https://github.com/SMSD75/Timetuning},
year = {2023},
date = {2023-07-14},
urldate = {2023-07-14},
booktitle = {ICCV},
abstract = {Spatially dense self-supervised learning is a rapidly growing problem domain with promising applications for unsupervised segmentation and pretraining for dense downstream tasks. Despite the abundance of temporal data in the form of videos, this information-rich source has been largely overlooked. Our paper aims to address this gap by proposing a novel approach that incorporates temporal consistency in dense self-supervised learning. While methods designed solely for images face difficulties in achieving even the same performance on videos, our method improves not only the representation quality for videos-but also images. Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos. This effectively facilitates the transfer of high-level information from videos to image representations. Time-tuning improves the state-of-the-art by 8-10% for unsupervised semantic segmentation on videos and matches it for images. We believe this method paves the way for further self-supervised scaling by leveraging the abundant availability of videos.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Spatially dense self-supervised learning is a rapidly growing problem domain with promising applications for unsupervised segmentation and pretraining for dense downstream tasks. Despite the abundance of temporal data in the form of videos, this information-rich source has been largely overlooked. Our paper aims to address this gap by proposing a novel approach that incorporates temporal consistency in dense self-supervised learning. While methods designed solely for images face difficulties in achieving even the same performance on videos, our method improves not only the representation quality for videos-but also images. Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos. This effectively facilitates the transfer of high-level information from videos to image representations. Time-tuning improves the state-of-the-art by 8-10% for unsupervised semantic segmentation on videos and matches it for images. We believe this method paves the way for further self-supervised scaling by leveraging the abundant availability of videos. |
| Tom van Sonsbeek, Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Cees G M Snoek, Marcel Worring: Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models. In: MICCAI, 2023, (Oral presentation). @inproceedings{SonsbeekMICCAI2023,
title = {Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models},
author = {Tom van Sonsbeek and Mohammad Mahdi Derakhshani and Ivona Najdenkoska and Cees G M Snoek and Marcel Worring},
url = {https://arxiv.org/abs/2303.05977},
year = {2023},
date = {2023-06-24},
urldate = {2023-06-24},
booktitle = {MICCAI},
abstract = {Medical Visual Question Answering (VQA) is an important challenge, as it would lead to faster and more accurate diagnoses and treatment decisions. Most existing methods approach it as a multi-class classification problem, which restricts the outcome to a predefined closed-set of curated answers. We focus on open-ended VQA and motivated by the recent advances in language models consider it as a generative task. Leveraging pre-trained language models, we introduce a novel method particularly suited for small, domain-specific, medical datasets. To properly communicate the medical images to the language model, we develop a network that maps the extracted visual features to a set of learnable tokens. Then, alongside the question, these learnable tokens directly prompt the language model. We explore recent parameter-efficient fine-tuning strategies for language models, which allow for resource- and data-efficient fine-tuning. We evaluate our approach on the prime medical VQA benchmarks, namely, Slake, OVQA and PathVQA. The results demonstrate that our approach outperforms existing methods across various training settings while also being computationally efficient.},
note = {Oral presentation},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Medical Visual Question Answering (VQA) is an important challenge, as it would lead to faster and more accurate diagnoses and treatment decisions. Most existing methods approach it as a multi-class classification problem, which restricts the outcome to a predefined closed-set of curated answers. We focus on open-ended VQA and motivated by the recent advances in language models consider it as a generative task. Leveraging pre-trained language models, we introduce a novel method particularly suited for small, domain-specific, medical datasets. To properly communicate the medical images to the language model, we develop a network that maps the extracted visual features to a set of learnable tokens. Then, alongside the question, these learnable tokens directly prompt the language model. We explore recent parameter-efficient fine-tuning strategies for language models, which allow for resource- and data-efficient fine-tuning. We evaluate our approach on the prime medical VQA benchmarks, namely, Slake, OVQA and PathVQA. The results demonstrate that our approach outperforms existing methods across various training settings while also being computationally efficient. |
| Tao Hu, William Thong, Pascal Mettes, Cees G M Snoek: Query by Activity Video in the Wild. In: ICIP, 2023. @inproceedings{HuICIP2023,
title = {Query by Activity Video in the Wild},
author = {Tao Hu and William Thong and Pascal Mettes and Cees G M Snoek},
year = {2023},
date = {2023-06-21},
urldate = {2023-06-21},
booktitle = {ICIP},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
|
| Haoliang Sun, Xiankai Lu, Haochen Wang, Yilong Yin, Xiantong Zhen, Cees G M Snoek, Ling Shao: Attentional Prototype Inference for Few-Shot Segmentation. In: Pattern Recognition, vol. 142, 2023. @article{SunPR23,
title = {Attentional Prototype Inference for Few-Shot Segmentation},
author = {Haoliang Sun and Xiankai Lu and Haochen Wang and Yilong Yin and Xiantong Zhen and Cees G M Snoek and Ling Shao},
url = {https://arxiv.org/abs/2105.06668},
year = {2023},
date = {2023-05-29},
urldate = {2021-04-30},
journal = {Pattern Recognition},
volume = {142},
abstract = {This paper aims to address few-shot segmentation. While existing prototype-based methods have achieved considerable success, they suffer from uncertainty and ambiguity caused by limited labeled examples. In this work, we propose attentional prototype inference (API), a probabilistic latent variable framework for few-shot segmentation. We define a global latent variable to represent the prototype of each object category, which we model as a probabilistic distribution. The probabilistic modeling of the prototype enhances the model’s generalization ability by handling the inherent uncertainty caused by limited data and intra-class variations of objects. To further enhance the model, we introduce a local latent variable to represent the attention map of each query image, which enables the model to attend to foreground objects while suppressing the background. The optimization of the proposed model is formulated as a variational Bayesian inference problem, which is established by amortized inference networks. We conduct extensive experiments on four benchmarks, where our proposal obtains at least competitive and often better performance than state-of-the-art prototype-based methods. We also provide comprehensive analyses and ablation studies to gain insight into the effectiveness of our method for few-shot segmentation.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
This paper aims to address few-shot segmentation. While existing prototype-based methods have achieved considerable success, they suffer from uncertainty and ambiguity caused by limited labeled examples. In this work, we propose attentional prototype inference (API), a probabilistic latent variable framework for few-shot segmentation. We define a global latent variable to represent the prototype of each object category, which we model as a probabilistic distribution. The probabilistic modeling of the prototype enhances the model’s generalization ability by handling the inherent uncertainty caused by limited data and intra-class variations of objects. To further enhance the model, we introduce a local latent variable to represent the attention map of each query image, which enables the model to attend to foreground objects while suppressing the background. The optimization of the proposed model is formulated as a variational Bayesian inference problem, which is established by amortized inference networks. We conduct extensive experiments on four benchmarks, where our proposal obtains at least competitive and often better performance than state-of-the-art prototype-based methods. We also provide comprehensive analyses and ablation studies to gain insight into the effectiveness of our method for few-shot segmentation. |