2024
|
 | Miltiadis Kofinas, Boris Knyazev, Yan Zhang, Yunlu Chen, Gertjan J Burghouts, Efstratios Gavves, Cees G M Snoek, David W Zhang: Graph Neural Networks for Learning Equivariant Representations of Neural Networks. In: ICLR, 2024, (Oral presentation). @inproceedings{KofinasICLR2024,
title = {Graph Neural Networks for Learning Equivariant Representations of Neural Networks},
author = {Miltiadis Kofinas and Boris Knyazev and Yan Zhang and Yunlu Chen and Gertjan J Burghouts and Efstratios Gavves and Cees G M Snoek and David W Zhang},
url = {https://github.com/mkofinas/neural-graphs
https://arxiv.org/abs/2403.12143},
year = {2024},
date = {2024-05-01},
urldate = {2024-05-01},
booktitle = {ICLR},
abstract = {Neural networks that process the parameters of other neural networks find applications in domains as diverse as classifying implicit neural representations, generating neural network weights, and predicting generalization errors. However, existing approaches either overlook the inherent permutation symmetry in the neural network or rely on intricate weight-sharing patterns to achieve equivariance, while ignoring the impact of the network architecture itself. In this work, we propose to represent neural networks as computational graphs of parameters, which allows us to harness powerful graph neural networks and transformers that preserve permutation symmetry. Consequently, our approach enables a single model to encode neural computational graphs with diverse architectures. We showcase the effectiveness of our method on a wide range of tasks, including classification and editing of implicit neural representations, predicting generalization performance, and learning to optimize, while consistently outperforming state-of-the-art methods.},
note = {Oral presentation},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Neural networks that process the parameters of other neural networks find applications in domains as diverse as classifying implicit neural representations, generating neural network weights, and predicting generalization errors. However, existing approaches either overlook the inherent permutation symmetry in the neural network or rely on intricate weight-sharing patterns to achieve equivariance, while ignoring the impact of the network architecture itself. In this work, we propose to represent neural networks as computational graphs of parameters, which allows us to harness powerful graph neural networks and transformers that preserve permutation symmetry. Consequently, our approach enables a single model to encode neural computational graphs with diverse architectures. We showcase the effectiveness of our method on a wide range of tasks, including classification and editing of implicit neural representations, predicting generalization performance, and learning to optimize, while consistently outperforming state-of-the-art methods. |
 | Wenfang Sun, Yingjun Du, Gaowen Liu, Ramana Kompella, Cees G M Snoek: Training-Free Semantic Segmentation via LLM-Supervision. arXiv:2404.00701, 2024. @unpublished{SunArxiv2024,
title = {Training-Free Semantic Segmentation via LLM-Supervision},
author = {Wenfang Sun and Yingjun Du and Gaowen Liu and Ramana Kompella and Cees G M Snoek},
url = {https://arxiv.org/abs/2404.00701},
year = {2024},
date = {2024-04-01},
abstract = {Recent advancements in open vocabulary models, like CLIP, have notably advanced zero-shot classification and segmentation by utilizing natural language for class-specific embeddings. However, most research has focused on improving model accuracy through prompt engineering, prompt learning, or fine-tuning with limited labeled data, thereby overlooking the importance of refining the class descriptors. This paper introduces a new approach to text-supervised semantic segmentation using supervision by a large language model (LLM) that does not require extra training. Our method starts from an LLM, like GPT-3, to generate a detailed set of subclasses for more accurate class representation. We then employ an advanced text-supervised semantic segmentation model to apply the generated subclasses as target labels, resulting in diverse segmentation results tailored to each subclass's unique characteristics. Additionally, we propose an assembly that merges the segmentation maps from the various subclass descriptors to ensure a more comprehensive representation of the different aspects in the test images. Through comprehensive experiments on three standard benchmarks, our method outperforms traditional text-supervised semantic segmentation methods by a marked margin.},
howpublished = {arXiv:2404.00701},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
Recent advancements in open vocabulary models, like CLIP, have notably advanced zero-shot classification and segmentation by utilizing natural language for class-specific embeddings. However, most research has focused on improving model accuracy through prompt engineering, prompt learning, or fine-tuning with limited labeled data, thereby overlooking the importance of refining the class descriptors. This paper introduces a new approach to text-supervised semantic segmentation using supervision by a large language model (LLM) that does not require extra training. Our method starts from an LLM, like GPT-3, to generate a detailed set of subclasses for more accurate class representation. We then employ an advanced text-supervised semantic segmentation model to apply the generated subclasses as target labels, resulting in diverse segmentation results tailored to each subclass's unique characteristics. Additionally, we propose an assembly that merges the segmentation maps from the various subclass descriptors to ensure a more comprehensive representation of the different aspects in the test images. Through comprehensive experiments on three standard benchmarks, our method outperforms traditional text-supervised semantic segmentation methods by a marked margin. |
 | Vincent Tao Hu, Di Wu, Yuki M Asano, Pascal Mettes, Basura Fernando, Björn Ommer, Cees G M Snoek: Flow Matching for Conditional Text Generation in a Few Sampling Steps. In: EACL, 2024. @inproceedings{HuEACL2024,
title = {Flow Matching for Conditional Text Generation in a Few Sampling Steps},
author = {Vincent Tao Hu and Di Wu and Yuki M Asano and Pascal Mettes and Basura Fernando and Björn Ommer and Cees G M Snoek},
url = {https://aclanthology.org/2024.eacl-short.33.pdf},
year = {2024},
date = {2024-03-27},
urldate = {2024-03-27},
booktitle = {EACL},
abstract = {Diffusion models are a promising tool for high-quality text generation. However, current models face multiple drawbacks including slow sampling, noise schedule sensitivity, and misalignment between the training and sampling stages. In this paper, we introduce FlowSeq, which bypasses all current drawbacks by leveraging flow matching for conditional text generation. FlowSeq can generate text in a few steps by training with a novel anchor loss, alleviating the need for expensive hyperparameter optimization of the noise schedule prevalent in diffusion models. We extensively evaluate our proposed method and show competitive performance in tasks such as question generation, open-domain dialogue, and paraphrasing.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Diffusion models are a promising tool for high-quality text generation. However, current models face multiple drawbacks including slow sampling, noise schedule sensitivity, and misalignment between the training and sampling stages. In this paper, we introduce FlowSeq, which bypasses all current drawbacks by leveraging flow matching for conditional text generation. FlowSeq can generate text in a few steps by training with a novel anchor loss, alleviating the need for expensive hyperparameter optimization of the noise schedule prevalent in diffusion models. We extensively evaluate our proposed method and show competitive performance in tasks such as question generation, open-domain dialogue, and paraphrasing. |
 | Vincent Tao Hu, David W Zhang, Mang Tang, Pascal Mettes, Deli Zhao, Cees G M Snoek: Latent Space Editing in Transformer-Based Flow Matching. In: AAAI Conference on Artificial Intelligence, 2024. @inproceedings{HuAAAI2024,
title = {Latent Space Editing in Transformer-Based Flow Matching},
author = {Vincent Tao Hu and David W Zhang and Mang Tang and Pascal Mettes and Deli Zhao and Cees G M Snoek},
url = {https://arxiv.org/abs/2312.10825},
year = {2024},
date = {2024-02-01},
urldate = {2024-02-01},
booktitle = {AAAI Conference on Artificial Intelligence},
abstract = {This paper strives for image editing via generative models. Flow Matching is an emerging generative modeling technique that offers the advantage of simple and efficient training. Simultaneously, a new transformer-based U-ViT has recently been proposed to replace the commonly used UNet for better scalability and performance in generative modeling. Hence, Flow Matching with a transformer backbone offers the potential for scalable and high-quality generative modeling, but their latent structure and editing ability are as of yet unknown. Hence, we adopt this setting and explore how to edit images through latent space manipulation. We introduce an editing space, which we call $u$-space, that can be manipulated in a controllable, accumulative, and composable manner. Additionally, we propose a tailored sampling solution to enable sampling with the more efficient adaptive step-size ODE solvers. Lastly, we put forth a straightforward yet powerful method for achieving fine-grained and nuanced editing using text prompts. Our framework is simple and efficient, all while being highly effective at editing images while preserving the essence of the original content. We will provide our source code and include it in the appendix.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper strives for image editing via generative models. Flow Matching is an emerging generative modeling technique that offers the advantage of simple and efficient training. Simultaneously, a new transformer-based U-ViT has recently been proposed to replace the commonly used UNet for better scalability and performance in generative modeling. Hence, Flow Matching with a transformer backbone offers the potential for scalable and high-quality generative modeling, but their latent structure and editing ability are as of yet unknown. Hence, we adopt this setting and explore how to edit images through latent space manipulation. We introduce an editing space, which we call $u$-space, that can be manipulated in a controllable, accumulative, and composable manner. Additionally, we propose a tailored sampling solution to enable sampling with the more efficient adaptive step-size ODE solvers. Lastly, we put forth a straightforward yet powerful method for achieving fine-grained and nuanced editing using text prompts. Our framework is simple and efficient, all while being highly effective at editing images while preserving the essence of the original content. We will provide our source code and include it in the appendix. |
 | Yingjun Du, Haoliang Sun, Xiantong Zhen, Jun Xu, Yilong Yin, Ling Shao, Cees G M Snoek: MetaKernel: Learning Variational Random Features with Limited Labels. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, iss. 3, pp. 1464-1478, 2024. @article{DuPAMI24,
title = {MetaKernel: Learning Variational Random Features with Limited Labels},
author = {Yingjun Du and Haoliang Sun and Xiantong Zhen and Jun Xu and Yilong Yin and Ling Shao and Cees G M Snoek},
url = {https://arxiv.org/abs/2105.03781
https://github.com/YDU-uva/MetaKernel},
doi = {https://doi.org/10.1109/TPAMI.2022.3154930},
year = {2024},
date = {2024-01-01},
urldate = {2024-01-01},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
volume = {46},
issue = {3},
pages = {1464-1478},
abstract = {Few-shot learning deals with the fundamental and challenging problem of learning from a few annotated samples, while being able to generalize well on new tasks. The crux of few-shot learning is to extract prior knowledge from related tasks to enable fast adaptation to a new task with a limited amount of data. In this paper, we propose meta-learning kernels with random Fourier features for few-shot learning, we call MetaKernel. Specifically, we propose learning variational random features in a data-driven manner to obtain task-specific kernels by leveraging the shared knowledge provided by related tasks in a meta-learning setting. We treat the random feature basis as the latent variable, which is estimated by variational inference. The shared knowledge from related tasks is incorporated into a context inference of the posterior, which we achieve via a long-short term memory module. To establish more expressive kernels, we deploy conditional normalizing flows based on coupling layers to achieve a richer posterior distribution over random Fourier bases. The resultant kernels are more informative and discriminative, which further improves the few-shot learning. To evaluate our method, we conduct extensive experiments on both few-shot image classification and regression tasks. A thorough ablation study demonstrates that the effectiveness of each introduced component in our method. The benchmark results on fourteen datasets demonstrate MetaKernel consistently delivers at least comparable and often better performance than state-of-the-art alternatives.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Few-shot learning deals with the fundamental and challenging problem of learning from a few annotated samples, while being able to generalize well on new tasks. The crux of few-shot learning is to extract prior knowledge from related tasks to enable fast adaptation to a new task with a limited amount of data. In this paper, we propose meta-learning kernels with random Fourier features for few-shot learning, we call MetaKernel. Specifically, we propose learning variational random features in a data-driven manner to obtain task-specific kernels by leveraging the shared knowledge provided by related tasks in a meta-learning setting. We treat the random feature basis as the latent variable, which is estimated by variational inference. The shared knowledge from related tasks is incorporated into a context inference of the posterior, which we achieve via a long-short term memory module. To establish more expressive kernels, we deploy conditional normalizing flows based on coupling layers to achieve a richer posterior distribution over random Fourier bases. The resultant kernels are more informative and discriminative, which further improves the few-shot learning. To evaluate our method, we conduct extensive experiments on both few-shot image classification and regression tasks. A thorough ablation study demonstrates that the effectiveness of each introduced component in our method. The benchmark results on fourteen datasets demonstrate MetaKernel consistently delivers at least comparable and often better performance than state-of-the-art alternatives. |
2023
|
 | Vincent Tao Hu, Yunlu Chen, Mathilde Caron, Yuki M Asano, Cees G M Snoek, Bjorn Ommer: Guided Diffusion from Self-Supervised Diffusion Features. arXiv:2312.08825, 2023. @unpublished{HuArxive2023,
title = {Guided Diffusion from Self-Supervised Diffusion Features},
author = {Vincent Tao Hu and Yunlu Chen and Mathilde Caron and Yuki M Asano and Cees G M Snoek and Bjorn Ommer},
url = {https://browse.arxiv.org/abs/2312.08825},
year = {2023},
date = {2023-12-14},
urldate = {2023-12-14},
abstract = {Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or classifier pretraining. That is why guidance was harnessed from self-supervised learning backbones, like DINO. However, recent studies have revealed that the feature representation derived from diffusion model itself is discriminative for numerous downstream tasks as well, which prompts us to propose a framework to extract guidance from, and specifically for, diffusion models. Our research has yielded several significant contributions. Firstly, the guidance signals from diffusion models are on par with those from class-conditioned diffusion models. Secondly, feature regularization, when based on the Sinkhorn-Knopp algorithm, can further enhance feature discriminability in comparison to unconditional diffusion models. Thirdly, we have constructed an online training approach that can concurrently derive guidance from diffusion models for diffusion models. Lastly, we have extended the application of diffusion models along the constant velocity path of ODE to achieve a more favorable balance between sampling steps and fidelity. The performance of our methods has been outstanding, outperforming related baseline comparisons in large-resolution datasets, such as ImageNet256, ImageNet256-100 and LSUN-Churches. Our code will be released.},
howpublished = {arXiv:2312.08825},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or classifier pretraining. That is why guidance was harnessed from self-supervised learning backbones, like DINO. However, recent studies have revealed that the feature representation derived from diffusion model itself is discriminative for numerous downstream tasks as well, which prompts us to propose a framework to extract guidance from, and specifically for, diffusion models. Our research has yielded several significant contributions. Firstly, the guidance signals from diffusion models are on par with those from class-conditioned diffusion models. Secondly, feature regularization, when based on the Sinkhorn-Knopp algorithm, can further enhance feature discriminability in comparison to unconditional diffusion models. Thirdly, we have constructed an online training approach that can concurrently derive guidance from diffusion models for diffusion models. Lastly, we have extended the application of diffusion models along the constant velocity path of ODE to achieve a more favorable balance between sampling steps and fidelity. The performance of our methods has been outstanding, outperforming related baseline comparisons in large-resolution datasets, such as ImageNet256, ImageNet256-100 and LSUN-Churches. Our code will be released. |
 | Vincent Tao Hu, Wenzhe Yin, Pingchuan Ma, Yunlu Chen, Basura Fernando, Yuki M Asano, Efstratios Gavves, Pascal Mettes, Bjorn Ommer, Cees G. M. Snoek: Motion Flow Matching for Human Motion Synthesis and Editing. arXiv:2312.08895, 2023. @unpublished{HuArxive2023b,
title = {Motion Flow Matching for Human Motion Synthesis and Editing},
author = {Vincent Tao Hu and Wenzhe Yin and Pingchuan Ma and Yunlu Chen and Basura Fernando and Yuki M Asano and Efstratios Gavves and Pascal Mettes and Bjorn Ommer and Cees G. M. Snoek},
url = {https://browse.arxiv.org/abs/2312.08895},
year = {2023},
date = {2023-12-14},
urldate = {2023-12-14},
abstract = {Human motion synthesis is a fundamental task in computer animation. Recent methods based on diffusion models or GPT structure demonstrate commendable performance but exhibit drawbacks in terms of slow sampling speeds and error accumulation. In this paper, we propose emph{Motion Flow Matching}, a novel generative model designed for human motion generation featuring efficient sampling and effectiveness in motion editing applications. Our method reduces the sampling complexity from thousand steps in previous diffusion models to just ten steps, while achieving comparable performance in text-to-motion and action-to-motion generation benchmarks. Noticeably, our approach establishes a new state-of-the-art Fréchet Inception Distance on the KIT-ML dataset. What is more, we tailor a straightforward motion editing paradigm named emph{sampling trajectory rewriting} leveraging the ODE-style generative models and apply it to various editing scenarios including motion prediction, motion in-between prediction, motion interpolation, and upper-body editing. Our code will be released.},
howpublished = {arXiv:2312.08895},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
Human motion synthesis is a fundamental task in computer animation. Recent methods based on diffusion models or GPT structure demonstrate commendable performance but exhibit drawbacks in terms of slow sampling speeds and error accumulation. In this paper, we propose emph{Motion Flow Matching}, a novel generative model designed for human motion generation featuring efficient sampling and effectiveness in motion editing applications. Our method reduces the sampling complexity from thousand steps in previous diffusion models to just ten steps, while achieving comparable performance in text-to-motion and action-to-motion generation benchmarks. Noticeably, our approach establishes a new state-of-the-art Fréchet Inception Distance on the KIT-ML dataset. What is more, we tailor a straightforward motion editing paradigm named emph{sampling trajectory rewriting} leveraging the ODE-style generative models and apply it to various editing scenarios including motion prediction, motion in-between prediction, motion interpolation, and upper-body editing. Our code will be released. |
 | Mohammad Mahdi Derakhshani, Menglin Xia, Harkirat Behl, Cees G M Snoek, Victor Rühle: Unlocking Spatial Comprehension in Text-to-Image Diffusion Models. arXiv:2311.17937, 2023. @unpublished{DerakhshaniArxive2023b,
title = {Unlocking Spatial Comprehension in Text-to-Image Diffusion Models},
author = {Mohammad Mahdi Derakhshani and Menglin Xia and Harkirat Behl and Cees G M Snoek and Victor Rühle},
url = {https://arxiv.org/abs/2311.17937},
year = {2023},
date = {2023-11-28},
urldate = {2023-11-28},
abstract = {We propose CompFuser, an image generation pipeline that enhances spatial comprehension and attribute assignment in text-to-image generative models. Our pipeline enables the interpretation of instructions defining spatial relationships between objects in a scene, such as `An image of a gray cat on the left of an orange dog', and generate corresponding images. This is especially important in order to provide more control to the user. CompFuser overcomes the limitation of existing text-to-image diffusion models by decoding the generation of multiple objects into iterative steps: first generating a single object and then editing the image by placing additional objects in their designated positions. To create training data for spatial comprehension and attribute assignment we introduce a synthetic data generation process, that leverages a frozen large language model and a frozen layout-based diffusion model for object placement. We compare our approach to strong baselines and show that our model outperforms state-of-the-art image generation models in spatial comprehension and attribute assignment, despite being 3x to 5x smaller in parameters.},
howpublished = {arXiv:2311.17937},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
We propose CompFuser, an image generation pipeline that enhances spatial comprehension and attribute assignment in text-to-image generative models. Our pipeline enables the interpretation of instructions defining spatial relationships between objects in a scene, such as `An image of a gray cat on the left of an orange dog', and generate corresponding images. This is especially important in order to provide more control to the user. CompFuser overcomes the limitation of existing text-to-image diffusion models by decoding the generation of multiple objects into iterative steps: first generating a single object and then editing the image by placing additional objects in their designated positions. To create training data for spatial comprehension and attribute assignment we introduce a synthetic data generation process, that leverages a frozen large language model and a frozen layout-based diffusion model for object placement. We compare our approach to strong baselines and show that our model outperforms state-of-the-art image generation models in spatial comprehension and attribute assignment, despite being 3x to 5x smaller in parameters. |
 | Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Cees G M Snoek, Marcel Worring, Yuki M Asano: Small Visual Language Models can also be Open-Ended Few-Shot Learners. arXiv:2310.00500, 2023. @unpublished{DerakhshaniArxive2023,
title = {Small Visual Language Models can also be Open-Ended Few-Shot Learners},
author = {Mohammad Mahdi Derakhshani and Ivona Najdenkoska and Cees G M Snoek and Marcel Worring and Yuki M Asano},
url = {https://arxiv.org/abs/2310.00500},
year = {2023},
date = {2023-09-30},
urldate = {2023-09-30},
abstract = {We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks open-ended few-shot abilities of small visual language models. Our proposed adaptation algorithm explicitly learns from symbolic, yet self-supervised training tasks. Specifically, our approach imitates image captions in a self-supervised way based on clustering a large pool of images followed by assigning semantically-unrelated names to clusters. By doing so, we construct the `self-context', a training signal consisting of interleaved sequences of image and pseudo-caption pairs and a query image for which the model is trained to produce the right pseudo-caption. We demonstrate the performance and flexibility of SeCAt on several multimodal few-shot datasets, spanning various granularities. By using models with approximately 1B parameters we outperform the few-shot abilities of much larger models, such as Frozen and FROMAGe. SeCAt opens new possibilities for research in open-ended few-shot learning that otherwise requires access to large or proprietary models.},
howpublished = {arXiv:2310.00500},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks open-ended few-shot abilities of small visual language models. Our proposed adaptation algorithm explicitly learns from symbolic, yet self-supervised training tasks. Specifically, our approach imitates image captions in a self-supervised way based on clustering a large pool of images followed by assigning semantically-unrelated names to clusters. By doing so, we construct the `self-context', a training signal consisting of interleaved sequences of image and pseudo-caption pairs and a query image for which the model is trained to produce the right pseudo-caption. We demonstrate the performance and flexibility of SeCAt on several multimodal few-shot datasets, spanning various granularities. By using models with approximately 1B parameters we outperform the few-shot abilities of much larger models, such as Frozen and FROMAGe. SeCAt opens new possibilities for research in open-ended few-shot learning that otherwise requires access to large or proprietary models. |
 | Yingjun Du, Zehao Xiao, Shengcai Liao, Cees G M Snoek: ProtoDiff: Learning to Learn Prototypical Networks by Task-Guided Diffusion. In: NeurIPS, 2023. @inproceedings{DuNeurips2023,
title = {ProtoDiff: Learning to Learn Prototypical Networks by Task-Guided Diffusion},
author = {Yingjun Du and Zehao Xiao and Shengcai Liao and Cees G M Snoek},
url = {https://arxiv.org/abs/2306.14770
https://github.com/YDU-uva/ProtoDiff},
year = {2023},
date = {2023-09-23},
urldate = {2023-09-23},
booktitle = {NeurIPS},
abstract = {Prototype-based meta-learning has emerged as a powerful technique for addressing few-shot learning challenges. However, estimating a deterministic prototype using a simple average function from a limited number of examples remains a fragile process. To overcome this limitation, we introduce ProtoDiff, a novel framework that leverages a task-guided diffusion model during the meta-training phase to gradually generate prototypes, thereby providing efficient class representations. Specifically, a set of prototypes is optimized to achieve per-task prototype overfitting, enabling accurately obtaining the overfitted prototypes for individual tasks. Furthermore, we introduce a task-guided diffusion process within the prototype space, enabling the meta-learning of a generative process that transitions from a vanilla prototype to an overfitted prototype. ProtoDiff gradually generates task-specific prototypes from random noise during the meta-test stage, conditioned on the limited samples available for the new task. Furthermore, to expedite training and enhance ProtoDiff's performance, we propose the utilization of residual prototype learning, which leverages the sparsity of the residual prototype. We conduct thorough ablation studies to demonstrate its ability to accurately capture the underlying prototype distribution and enhance generalization. The new state-of-the-art performance on within-domain, cross-domain, and few-task few-shot classification further substantiates the benefit of ProtoDiff.},
howpublished = {arXiv:2306.14770},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Prototype-based meta-learning has emerged as a powerful technique for addressing few-shot learning challenges. However, estimating a deterministic prototype using a simple average function from a limited number of examples remains a fragile process. To overcome this limitation, we introduce ProtoDiff, a novel framework that leverages a task-guided diffusion model during the meta-training phase to gradually generate prototypes, thereby providing efficient class representations. Specifically, a set of prototypes is optimized to achieve per-task prototype overfitting, enabling accurately obtaining the overfitted prototypes for individual tasks. Furthermore, we introduce a task-guided diffusion process within the prototype space, enabling the meta-learning of a generative process that transitions from a vanilla prototype to an overfitted prototype. ProtoDiff gradually generates task-specific prototypes from random noise during the meta-test stage, conditioned on the limited samples available for the new task. Furthermore, to expedite training and enhance ProtoDiff's performance, we propose the utilization of residual prototype learning, which leverages the sparsity of the residual prototype. We conduct thorough ablation studies to demonstrate its ability to accurately capture the underlying prototype distribution and enhance generalization. The new state-of-the-art performance on within-domain, cross-domain, and few-task few-shot classification further substantiates the benefit of ProtoDiff. |
 | Sarah Rastegar, Hazel Doughty, Cees G M Snoek: Learn to Categorize or Categorize to Learn? Self-Coding for Generalized Category Discovery. In: NeurIPS, 2023. @inproceedings{RastegarNeurips2023,
title = {Learn to Categorize or Categorize to Learn? Self-Coding for Generalized Category Discovery},
author = {Sarah Rastegar and Hazel Doughty and Cees G M Snoek},
url = {https://arxiv.org/abs/2310.19776
https://github.com/SarahRastegar/InfoSieve},
year = {2023},
date = {2023-09-23},
urldate = {2023-09-23},
booktitle = {NeurIPS},
abstract = {In the quest for unveiling novel categories at test time, we confront the inherent limitations of traditional supervised recognition models that are restricted by a predefined category set. While strides have been made in the realms of self-supervised and open-world learning towards test-time category discovery, a crucial yet often overlooked question persists: what exactly delineates a category? In this paper, we conceptualize a category through the lens of optimization, viewing it as an optimal solution to a well-defined problem. Harnessing this unique conceptualization, we propose a novel, efficient and self-supervised method capable of discovering previously unknown categories at test time. A salient feature of our approach is the assignment of minimum length category codes to individual data instances, which encapsulates the implicit category hierarchy prevalent in real-world datasets. This mechanism affords us enhanced control over category granularity, thereby equipping our model to handle fine-grained categories adeptly. Experimental evaluations, bolstered by state-of-the-art benchmark comparisons, testify to the efficacy of our solution in managing unknown categories at test time. Furthermore, we fortify our proposition with a theoretical foundation, providing proof of its optimality. },
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In the quest for unveiling novel categories at test time, we confront the inherent limitations of traditional supervised recognition models that are restricted by a predefined category set. While strides have been made in the realms of self-supervised and open-world learning towards test-time category discovery, a crucial yet often overlooked question persists: what exactly delineates a category? In this paper, we conceptualize a category through the lens of optimization, viewing it as an optimal solution to a well-defined problem. Harnessing this unique conceptualization, we propose a novel, efficient and self-supervised method capable of discovering previously unknown categories at test time. A salient feature of our approach is the assignment of minimum length category codes to individual data instances, which encapsulates the implicit category hierarchy prevalent in real-world datasets. This mechanism affords us enhanced control over category granularity, thereby equipping our model to handle fine-grained categories adeptly. Experimental evaluations, bolstered by state-of-the-art benchmark comparisons, testify to the efficacy of our solution in managing unknown categories at test time. Furthermore, we fortify our proposition with a theoretical foundation, providing proof of its optimality. |
 | Yunhua Zhang, Hazel Doughty, Cees G M Snoek: Learning Unseen Modality Interaction. In: NeurIPS, 2023. @inproceedings{ZhangNeurips2023,
title = {Learning Unseen Modality Interaction},
author = {Yunhua Zhang and Hazel Doughty and Cees G M Snoek},
url = {https://arxiv.org/abs/2306.12795
https://xiaobai1217.github.io/Unseen-Modality-Interaction/},
year = {2023},
date = {2023-09-22},
urldate = {2023-09-22},
booktitle = {NeurIPS},
abstract = {Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved. This allows the information to be accumulated with a simple summation operation across available modalities. To reduce overfitting to less discriminative modality combinations during training, we further improve the model learning with pseudo-supervision indicating the reliability of a modality's prediction. We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval.},
howpublished = {arXiv:2306.12795},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved. This allows the information to be accumulated with a simple summation operation across available modalities. To reduce overfitting to less discriminative modality combinations during training, we further improve the model learning with pseudo-supervision indicating the reliability of a modality's prediction. We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval. |
 | Mohammad Mahdi Derakhshani, Enrique Sanchez, Adrian Bulat, Victor Guilherme Turrisi da Costa, Cees G M Snoek, Georgios Tzimiropoulos, Brais Martinez: Bayesian Prompt Learning for Image-Language Model Generalization. In: ICCV, 2023. @inproceedings{DerakhshaniICCV2023,
title = {Bayesian Prompt Learning for Image-Language Model Generalization},
author = {Mohammad Mahdi Derakhshani and Enrique Sanchez and Adrian Bulat and Victor Guilherme Turrisi da Costa and Cees G M Snoek and Georgios Tzimiropoulos and Brais Martinez},
url = {https://arxiv.org/abs/2210.02390},
year = {2023},
date = {2023-07-14},
urldate = {2023-03-14},
booktitle = {ICCV},
abstract = {Foundational image-language models have generated considerable interest due to their efficient adaptation to downstream tasks by prompt learning. Prompt learning treats part of the language model input as trainable while freezing the rest, and optimizes an Empirical Risk Minimization objective. However, Empirical Risk Minimization is known to suffer from distributional shifts which hurt generalizability to prompts unseen during training. By leveraging the regularization ability of Bayesian methods, we frame prompt learning from the Bayesian perspective and formulate it as a variational inference problem. Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts. Our framework is implemented by modeling the input prompt space in a probabilistic manner, as an a priori distribution which makes our proposal compatible with prompt learning approaches that are unconditional or conditional on the image. We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space, prevents learning spurious features, and exploits transferable invariant features. This results in better generalization of unseen prompts, even across different datasets and domains.},
howpublished = {arXiv:2210.02390},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Foundational image-language models have generated considerable interest due to their efficient adaptation to downstream tasks by prompt learning. Prompt learning treats part of the language model input as trainable while freezing the rest, and optimizes an Empirical Risk Minimization objective. However, Empirical Risk Minimization is known to suffer from distributional shifts which hurt generalizability to prompts unseen during training. By leveraging the regularization ability of Bayesian methods, we frame prompt learning from the Bayesian perspective and formulate it as a variational inference problem. Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts. Our framework is implemented by modeling the input prompt space in a probabilistic manner, as an a priori distribution which makes our proposal compatible with prompt learning approaches that are unconditional or conditional on the image. We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space, prevents learning spurious features, and exploits transferable invariant features. This results in better generalization of unseen prompts, even across different datasets and domains. |
 | Aritra Bhowmik, Martin R Oswald, Yu Wang, Nora Baka, Cees G M Snoek: Detecting Objects with Graph Priors and Graph Refinement. In: ICCV, 2023. @inproceedings{BhowmikICCV2023,
title = {Detecting Objects with Graph Priors and Graph Refinement},
author = {Aritra Bhowmik and Martin R Oswald and Yu Wang and Nora Baka and Cees G M Snoek},
url = {https://arxiv.org/abs/2212.12395},
year = {2023},
date = {2023-07-14},
urldate = {2022-12-23},
booktitle = {ICCV},
abstract = {The goal of this paper is to detect objects by exploiting their interrelationships. Rather than relying on predefined and labeled graph structures, we infer a graph prior from object co-occurrence statistics. The key idea of our paper is to model object relations as a function of initial class predictions and co-occurrence priors to generate a graph representation of an image for improved classification and bounding box regression. We additionally learn the object-relation joint distribution via energy based modeling. Sampling from this distribution generates a refined graph representation of the image which in turn produces improved detection performance. Experiments on the Visual Genome and MS-COCO datasets demonstrate our method is detector agnostic, end-to-end trainable, and especially beneficial for rare object classes. What is more, we establish a consistent improvement over object detectors like DETR and Faster-RCNN, as well as state-of-the-art methods modeling object interrelationships.},
howpublished = {arXiv:2212.12395},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
The goal of this paper is to detect objects by exploiting their interrelationships. Rather than relying on predefined and labeled graph structures, we infer a graph prior from object co-occurrence statistics. The key idea of our paper is to model object relations as a function of initial class predictions and co-occurrence priors to generate a graph representation of an image for improved classification and bounding box regression. We additionally learn the object-relation joint distribution via energy based modeling. Sampling from this distribution generates a refined graph representation of the image which in turn produces improved detection performance. Experiments on the Visual Genome and MS-COCO datasets demonstrate our method is detector agnostic, end-to-end trainable, and especially beneficial for rare object classes. What is more, we establish a consistent improvement over object detectors like DETR and Faster-RCNN, as well as state-of-the-art methods modeling object interrelationships. |
 | Fida Mohammad Thoker, Hazel Doughty, Cees G M Snoek: Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization. In: ICCV, 2023. @inproceedings{ThokerICCV2023,
title = {Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization},
author = {Fida Mohammad Thoker and Hazel Doughty and Cees G M Snoek},
url = {https://arxiv.org/abs/2303.11003},
year = {2023},
date = {2023-07-14},
urldate = {2023-03-20},
booktitle = {ICCV},
abstract = {We propose a self-supervised method for learning motion-focused video representations. Existing approaches minimize distances between temporally augmented videos, which maintain high spatial similarity. We instead propose to learn similarities between videos with identical local motion dynamics but an otherwise different appearance. We do so by adding synthetic motion trajectories to videos which we refer to as tubelets. By simulating different tubelet motions and applying transformations, such as scaling and rotation, we introduce motion patterns beyond what is present in the pretraining data. This allows us to learn a video representation that is remarkably data-efficient: our approach maintains performance when using only 25% of the pretraining videos. Experiments on 10 diverse downstream settings demonstrate our competitive performance and generalizability to new domains and fine-grained actions},
howpublished = {arXiv:2303.11003},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We propose a self-supervised method for learning motion-focused video representations. Existing approaches minimize distances between temporally augmented videos, which maintain high spatial similarity. We instead propose to learn similarities between videos with identical local motion dynamics but an otherwise different appearance. We do so by adding synthetic motion trajectories to videos which we refer to as tubelets. By simulating different tubelet motions and applying transformations, such as scaling and rotation, we introduce motion patterns beyond what is present in the pretraining data. This allows us to learn a video representation that is remarkably data-efficient: our approach maintains performance when using only 25% of the pretraining videos. Experiments on 10 diverse downstream settings demonstrate our competitive performance and generalizability to new domains and fine-grained actions |
 | Pengwan Yang, Cees G M Snoek, Yuki M Asano: Self-Ordering Point Clouds. In: ICCV, 2023, (Oral presentation). @inproceedings{YangICCV2023,
title = {Self-Ordering Point Clouds},
author = {Pengwan Yang and Cees G M Snoek and Yuki M Asano},
url = {https://arxiv.org/abs/2304.00961},
year = {2023},
date = {2023-07-14},
urldate = {2023-07-14},
booktitle = {ICCV},
abstract = {In this paper we address the task of finding representative subsets of points in a 3D point cloud by means of a point-wise ordering. Only a few works have tried to address this challenging vision problem, all with the help of hard to obtain point and cloud labels. Different from these works, we introduce the task of point-wise ordering in 3D point clouds through self-supervision, which we call self-ordering. We further contribute the first end-to-end trainable network that learns a point-wise ordering in a self-supervised fashion. It utilizes a novel differentiable point scoring-sorting strategy and it constructs an hierarchical contrastive scheme to obtain self-supervision signals. We extensively ablate the method and show its scalability and superior performance even compared to supervised ordering methods on multiple datasets and tasks including zero-shot ordering of point clouds from unseen categories.},
howpublished = {arXiv:2304.00961},
note = {Oral presentation},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper we address the task of finding representative subsets of points in a 3D point cloud by means of a point-wise ordering. Only a few works have tried to address this challenging vision problem, all with the help of hard to obtain point and cloud labels. Different from these works, we introduce the task of point-wise ordering in 3D point clouds through self-supervision, which we call self-ordering. We further contribute the first end-to-end trainable network that learns a point-wise ordering in a self-supervised fashion. It utilizes a novel differentiable point scoring-sorting strategy and it constructs an hierarchical contrastive scheme to obtain self-supervision signals. We extensively ablate the method and show its scalability and superior performance even compared to supervised ordering methods on multiple datasets and tasks including zero-shot ordering of point clouds from unseen categories. |
 | Mengmeng Jing, Xiantong Zhen, Jingjing Li, Cees G M Snoek: Order-preserving Consistency Regularization for Domain Adaptation and Generalization. In: ICCV, 2023. @inproceedings{JingICCV2023,
title = {Order-preserving Consistency Regularization for Domain Adaptation and Generalization},
author = {Mengmeng Jing and Xiantong Zhen and Jingjing Li and Cees G M Snoek},
url = {https://arxiv.org/abs/2309.13258},
year = {2023},
date = {2023-07-14},
urldate = {2023-07-14},
booktitle = {ICCV},
abstract = {Deep learning models fail on cross-domain challenges if the model is oversensitive to domain-specific attributes, e.g., lightning, background, camera angle, etc. To alleviate this problem, data augmentation coupled with consistency regularization are commonly adopted to make the model less sensitive to domain-specific attributes. Consistency regularization enforces the model to output the same representation or prediction for two views of one image. These constraints, however, are either too strict or not order-preserving for the classification probabilities. In this work, we propose the Order-preserving Consistency Regularization (OCR) for cross-domain tasks. The order-preserving property for the prediction makes the model robust to task-irrelevant transformations. As a result, the model becomes less sensitive to the domain-specific attributes. The comprehensive experiments show that our method achieves clear advantages on five different cross-domain tasks.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Deep learning models fail on cross-domain challenges if the model is oversensitive to domain-specific attributes, e.g., lightning, background, camera angle, etc. To alleviate this problem, data augmentation coupled with consistency regularization are commonly adopted to make the model less sensitive to domain-specific attributes. Consistency regularization enforces the model to output the same representation or prediction for two views of one image. These constraints, however, are either too strict or not order-preserving for the classification probabilities. In this work, we propose the Order-preserving Consistency Regularization (OCR) for cross-domain tasks. The order-preserving property for the prediction makes the model robust to task-irrelevant transformations. As a result, the model becomes less sensitive to the domain-specific attributes. The comprehensive experiments show that our method achieves clear advantages on five different cross-domain tasks. |
 | Mohammadreza Salehi, Efstratios Gavves, Cees G M Snoek, Yuki M Asano: Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations. In: ICCV, 2023. @inproceedings{SalehiICCV2023,
title = {Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations},
author = {Mohammadreza Salehi and Efstratios Gavves and Cees G M Snoek and Yuki M Asano},
url = {https://arxiv.org/abs/2308.11796
https://github.com/SMSD75/Timetuning},
year = {2023},
date = {2023-07-14},
urldate = {2023-07-14},
booktitle = {ICCV},
abstract = {Spatially dense self-supervised learning is a rapidly growing problem domain with promising applications for unsupervised segmentation and pretraining for dense downstream tasks. Despite the abundance of temporal data in the form of videos, this information-rich source has been largely overlooked. Our paper aims to address this gap by proposing a novel approach that incorporates temporal consistency in dense self-supervised learning. While methods designed solely for images face difficulties in achieving even the same performance on videos, our method improves not only the representation quality for videos-but also images. Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos. This effectively facilitates the transfer of high-level information from videos to image representations. Time-tuning improves the state-of-the-art by 8-10% for unsupervised semantic segmentation on videos and matches it for images. We believe this method paves the way for further self-supervised scaling by leveraging the abundant availability of videos.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Spatially dense self-supervised learning is a rapidly growing problem domain with promising applications for unsupervised segmentation and pretraining for dense downstream tasks. Despite the abundance of temporal data in the form of videos, this information-rich source has been largely overlooked. Our paper aims to address this gap by proposing a novel approach that incorporates temporal consistency in dense self-supervised learning. While methods designed solely for images face difficulties in achieving even the same performance on videos, our method improves not only the representation quality for videos-but also images. Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos. This effectively facilitates the transfer of high-level information from videos to image representations. Time-tuning improves the state-of-the-art by 8-10% for unsupervised semantic segmentation on videos and matches it for images. We believe this method paves the way for further self-supervised scaling by leveraging the abundant availability of videos. |
 | Tom van Sonsbeek, Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Cees G M Snoek, Marcel Worring: Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models. In: MICCAI, 2023, (Oral presentation). @inproceedings{SonsbeekMICCAI2023,
title = {Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models},
author = {Tom van Sonsbeek and Mohammad Mahdi Derakhshani and Ivona Najdenkoska and Cees G M Snoek and Marcel Worring},
url = {https://arxiv.org/abs/2303.05977},
year = {2023},
date = {2023-06-24},
urldate = {2023-06-24},
booktitle = {MICCAI},
abstract = {Medical Visual Question Answering (VQA) is an important challenge, as it would lead to faster and more accurate diagnoses and treatment decisions. Most existing methods approach it as a multi-class classification problem, which restricts the outcome to a predefined closed-set of curated answers. We focus on open-ended VQA and motivated by the recent advances in language models consider it as a generative task. Leveraging pre-trained language models, we introduce a novel method particularly suited for small, domain-specific, medical datasets. To properly communicate the medical images to the language model, we develop a network that maps the extracted visual features to a set of learnable tokens. Then, alongside the question, these learnable tokens directly prompt the language model. We explore recent parameter-efficient fine-tuning strategies for language models, which allow for resource- and data-efficient fine-tuning. We evaluate our approach on the prime medical VQA benchmarks, namely, Slake, OVQA and PathVQA. The results demonstrate that our approach outperforms existing methods across various training settings while also being computationally efficient.},
note = {Oral presentation},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Medical Visual Question Answering (VQA) is an important challenge, as it would lead to faster and more accurate diagnoses and treatment decisions. Most existing methods approach it as a multi-class classification problem, which restricts the outcome to a predefined closed-set of curated answers. We focus on open-ended VQA and motivated by the recent advances in language models consider it as a generative task. Leveraging pre-trained language models, we introduce a novel method particularly suited for small, domain-specific, medical datasets. To properly communicate the medical images to the language model, we develop a network that maps the extracted visual features to a set of learnable tokens. Then, alongside the question, these learnable tokens directly prompt the language model. We explore recent parameter-efficient fine-tuning strategies for language models, which allow for resource- and data-efficient fine-tuning. We evaluate our approach on the prime medical VQA benchmarks, namely, Slake, OVQA and PathVQA. The results demonstrate that our approach outperforms existing methods across various training settings while also being computationally efficient. |
 | Tao Hu, William Thong, Pascal Mettes, Cees G M Snoek: Query by Activity Video in the Wild. In: ICIP, 2023. @inproceedings{HuICIP2023,
title = {Query by Activity Video in the Wild},
author = {Tao Hu and William Thong and Pascal Mettes and Cees G M Snoek},
year = {2023},
date = {2023-06-21},
urldate = {2023-06-21},
booktitle = {ICIP},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
|
 | Haoliang Sun, Xiankai Lu, Haochen Wang, Yilong Yin, Xiantong Zhen, Cees G M Snoek, Ling Shao: Attentional Prototype Inference for Few-Shot Segmentation. In: Pattern Recognition, vol. 142, 2023. @article{SunPR23,
title = {Attentional Prototype Inference for Few-Shot Segmentation},
author = {Haoliang Sun and Xiankai Lu and Haochen Wang and Yilong Yin and Xiantong Zhen and Cees G M Snoek and Ling Shao},
url = {https://arxiv.org/abs/2105.06668},
year = {2023},
date = {2023-05-29},
urldate = {2021-04-30},
journal = {Pattern Recognition},
volume = {142},
abstract = {This paper aims to address few-shot segmentation. While existing prototype-based methods have achieved considerable success, they suffer from uncertainty and ambiguity caused by limited labeled examples. In this work, we propose attentional prototype inference (API), a probabilistic latent variable framework for few-shot segmentation. We define a global latent variable to represent the prototype of each object category, which we model as a probabilistic distribution. The probabilistic modeling of the prototype enhances the model’s generalization ability by handling the inherent uncertainty caused by limited data and intra-class variations of objects. To further enhance the model, we introduce a local latent variable to represent the attention map of each query image, which enables the model to attend to foreground objects while suppressing the background. The optimization of the proposed model is formulated as a variational Bayesian inference problem, which is established by amortized inference networks. We conduct extensive experiments on four benchmarks, where our proposal obtains at least competitive and often better performance than state-of-the-art prototype-based methods. We also provide comprehensive analyses and ablation studies to gain insight into the effectiveness of our method for few-shot segmentation.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
This paper aims to address few-shot segmentation. While existing prototype-based methods have achieved considerable success, they suffer from uncertainty and ambiguity caused by limited labeled examples. In this work, we propose attentional prototype inference (API), a probabilistic latent variable framework for few-shot segmentation. We define a global latent variable to represent the prototype of each object category, which we model as a probabilistic distribution. The probabilistic modeling of the prototype enhances the model’s generalization ability by handling the inherent uncertainty caused by limited data and intra-class variations of objects. To further enhance the model, we introduce a local latent variable to represent the attention map of each query image, which enables the model to attend to foreground objects while suppressing the background. The optimization of the proposed model is formulated as a variational Bayesian inference problem, which is established by amortized inference networks. We conduct extensive experiments on four benchmarks, where our proposal obtains at least competitive and often better performance than state-of-the-art prototype-based methods. We also provide comprehensive analyses and ablation studies to gain insight into the effectiveness of our method for few-shot segmentation. |
 | Yingjun Du, Jiayi Shen, Xiantong Zhen, Cees G M Snoek: EMO: Episodic Memory Optimization for Few-Shot Meta-Learning. In: CoLLAs, 2023, (Oral presentation, top 12 papers.). @inproceedings{DuColla2023,
title = {EMO: Episodic Memory Optimization for Few-Shot Meta-Learning},
author = {Yingjun Du and Jiayi Shen and Xiantong Zhen and Cees G M Snoek},
url = {https://arxiv.org/abs/2306.05189
https://github.com/YDU-uva/EMO},
year = {2023},
date = {2023-05-16},
urldate = {2023-05-16},
booktitle = {CoLLAs},
abstract = {For few-shot meta-learning, gradient descent optimization is challenging due to the limited number of training samples per task. Inspired by the human ability to recall past learning experiences from the brain's memory, we propose an episodic memory optimization for meta learning, which we call EMO, that retains the gradient history of past experienced tasks in external memory. It enables few-shot learning in a memory-augmented way by leveraging the meta-learning setting and learns to retain and recall the learning process of past training tasks for gradient descent optimization. By doing so, EMO nudges the parameter updates in the right direction, even when the gradients provided by a limited number of examples are uninformative. Additionally, we prove theoretically that our algorithm converges for smooth, strongly convex objectives. EMO is generic, flexible, and model-agnostic, making it a simple plug-and-play optimizer seamlessly embedded into existing optimization-based meta-learning approaches. Empirically, EMO scales well with most of the few-shot classification benchmarks, and our experiments show that the optimization-based meta-learning method enjoys accelerated convergence and improved performance with EMO. },
note = {Oral presentation, top 12 papers.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
For few-shot meta-learning, gradient descent optimization is challenging due to the limited number of training samples per task. Inspired by the human ability to recall past learning experiences from the brain's memory, we propose an episodic memory optimization for meta learning, which we call EMO, that retains the gradient history of past experienced tasks in external memory. It enables few-shot learning in a memory-augmented way by leveraging the meta-learning setting and learns to retain and recall the learning process of past training tasks for gradient descent optimization. By doing so, EMO nudges the parameter updates in the right direction, even when the gradients provided by a limited number of examples are uninformative. Additionally, we prove theoretically that our algorithm converges for smooth, strongly convex objectives. EMO is generic, flexible, and model-agnostic, making it a simple plug-and-play optimizer seamlessly embedded into existing optimization-based meta-learning approaches. Empirically, EMO scales well with most of the few-shot classification benchmarks, and our experiments show that the optimization-based meta-learning method enjoys accelerated convergence and improved performance with EMO. |
 | Wenfang Sun, Yingjun Du, Xiantong Zhen, Fan Wang, Ling Wang, Cees G M Snoek: MetaModulation: Learning Variational Feature Hierarchies for Few-Shot Learning with Fewer Tasks. In: ICML, 2023. @inproceedings{SunICML2023,
title = {MetaModulation: Learning Variational Feature Hierarchies for Few-Shot Learning with Fewer Tasks},
author = {Wenfang Sun and Yingjun Du and Xiantong Zhen and Fan Wang and Ling Wang and Cees G M Snoek},
url = {https://arxiv.org/abs/2305.10309
https://github.com/lmsdss/MetaModulation},
year = {2023},
date = {2023-04-25},
urldate = {2023-04-25},
booktitle = {ICML},
abstract = {Meta-learning algorithms are able to learn a new task using previously learned knowledge, but they often require a large number of meta-training tasks which may not be readily available. To address this issue, we propose a method for few-shot learning with fewer tasks, which we call MetaModulation. The key idea is to use a neural network to increase the density of the meta-training tasks by modulating batch normalization parameters during meta-training. Additionally, we modify parameters at various network levels, rather than just a single layer, to increase task diversity. To account for the uncertainty caused by the limited training tasks, we propose a variational MetaModulation where the modulation parameters are treated as latent variables. We also introduce learning variational feature hierarchies by the variational MetaModulation, which modulates features at all layers and can consider task uncertainty and generate more diverse tasks. The ablation studies illustrate the advantages of utilizing a learnable task modulation at different levels and demonstrate the benefit of incorporating probabilistic variants in few-task meta-learning. Our MetaModulation and its variational variants consistently outperform state-of-the-art alternatives on four few-task meta-learning benchmarks.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Meta-learning algorithms are able to learn a new task using previously learned knowledge, but they often require a large number of meta-training tasks which may not be readily available. To address this issue, we propose a method for few-shot learning with fewer tasks, which we call MetaModulation. The key idea is to use a neural network to increase the density of the meta-training tasks by modulating batch normalization parameters during meta-training. Additionally, we modify parameters at various network levels, rather than just a single layer, to increase task diversity. To account for the uncertainty caused by the limited training tasks, we propose a variational MetaModulation where the modulation parameters are treated as latent variables. We also introduce learning variational feature hierarchies by the variational MetaModulation, which modulates features at all layers and can consider task uncertainty and generate more diverse tasks. The ablation studies illustrate the advantages of utilizing a learnable task modulation at different levels and demonstrate the benefit of incorporating probabilistic variants in few-task meta-learning. Our MetaModulation and its variational variants consistently outperform state-of-the-art alternatives on four few-task meta-learning benchmarks. |
 | Yan Zhang, David W Zhang, Simon Lacoste-Julien, Gertjan J Burghouts, Cees G M Snoek: Unlocking Slot Attention by Changing Optimal Transport Costs. In: ICML, 2023. @inproceedings{ZhangICML2023,
title = {Unlocking Slot Attention by Changing Optimal Transport Costs},
author = {Yan Zhang and David W Zhang and Simon Lacoste-Julien and Gertjan J Burghouts and Cees G M Snoek},
url = {https://arxiv.org/abs/2301.13197
https://github.com/davzha/MESH},
year = {2023},
date = {2023-04-24},
urldate = {2023-04-24},
booktitle = {ICML},
abstract = {Slot attention is a powerful method for object-centric modeling in images and videos. However, its set-equivariance limits its ability to handle videos with a dynamic number of objects because it cannot break ties. To overcome this limitation, we first establish a connection between slot attention and optimal transport. Based on this new perspective we propose MESH (Minimize Entropy of Sinkhorn): a cross-attention module that combines the tiebreaking properties of unregularized optimal transport with the speed of regularized optimal transport. We evaluate slot attention using MESH on multiple object-centric learning benchmarks and find significant improvements over slot attention in every setting.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Slot attention is a powerful method for object-centric modeling in images and videos. However, its set-equivariance limits its ability to handle videos with a dynamic number of objects because it cannot break ties. To overcome this limitation, we first establish a connection between slot attention and optimal transport. Based on this new perspective we propose MESH (Minimize Entropy of Sinkhorn): a cross-attention module that combines the tiebreaking properties of unregularized optimal transport with the speed of regularized optimal transport. We evaluate slot attention using MESH on multiple object-centric learning benchmarks and find significant improvements over slot attention in every setting. |
 | Shuo Chen, Yingjun Du, Pascal Mettes, Cees G M Snoek: Multi-Label Meta Weighting for Long-Tailed Dynamic Scene Graph Generation. In: ICMR, 2023, (Oral presentation.). @inproceedings{ChenICMR2023,
title = {Multi-Label Meta Weighting for Long-Tailed Dynamic Scene Graph Generation},
author = {Shuo Chen and Yingjun Du and Pascal Mettes and Cees G M Snoek},
url = {https://arxiv.org/abs/2306.10122},
year = {2023},
date = {2023-04-03},
urldate = {2023-04-03},
booktitle = {ICMR},
abstract = {This paper investigates the problem of scene graph generation in videos, where the goal is to capture semantic relations between subjects and objects in the form of subject, predicate, object triplets. Recognizing the predicate between subject and object pairs is imbalanced and multi-label in nature, ranging from ubiquitous interactions such as spatial relationships e.g. in_front_of to rare interactions such as twisting. In popular benchmarks such as Action Genome and VidOR, the imbalance ratio between most and least frequent predicates is 3218 and 3408, respectively, far higher even than benchmarks specifically designed to address long-tailed recognition. Due to these long-tailed distributions and label co-occurrences, recent state-of-the-art methods rely heavily on the most often occurring predicate classes, ignoring predicate classes in the long tail. In this paper, we analyze the limitations of current approaches for scene graph generation in videos and find a one-to-one correspondence between predicate frequency and recall performance. To make the step towards unbiased scene graph generation in videos, we introduce a multi-label meta-learning framework to deal with the biased predicate distribution. Our meta-learning framework learns a meta-weight network for each training sample over all possible label losses. We evaluate our approach on the Action Genome and VidOR benchmarks by building on two current state-of-the-art methods for each benchmark. The experiments confirm that our multi-label meta-weight network improves the performance for predicates in the long tail without hampering performance for head classes, resulting in better overall performance and favorable generalizability.},
note = {Oral presentation.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper investigates the problem of scene graph generation in videos, where the goal is to capture semantic relations between subjects and objects in the form of subject, predicate, object triplets. Recognizing the predicate between subject and object pairs is imbalanced and multi-label in nature, ranging from ubiquitous interactions such as spatial relationships e.g. in_front_of to rare interactions such as twisting. In popular benchmarks such as Action Genome and VidOR, the imbalance ratio between most and least frequent predicates is 3218 and 3408, respectively, far higher even than benchmarks specifically designed to address long-tailed recognition. Due to these long-tailed distributions and label co-occurrences, recent state-of-the-art methods rely heavily on the most often occurring predicate classes, ignoring predicate classes in the long tail. In this paper, we analyze the limitations of current approaches for scene graph generation in videos and find a one-to-one correspondence between predicate frequency and recall performance. To make the step towards unbiased scene graph generation in videos, we introduce a multi-label meta-learning framework to deal with the biased predicate distribution. Our meta-learning framework learns a meta-weight network for each training sample over all possible label losses. We evaluate our approach on the Action Genome and VidOR benchmarks by building on two current state-of-the-art methods for each benchmark. The experiments confirm that our multi-label meta-weight network improves the performance for predicates in the long tail without hampering performance for head classes, resulting in better overall performance and favorable generalizability. |
 | Vincent Tao Hu, David W Zhang, Yuki M Asano, Gertjan J Burghouts, Cees G M Snoek: Self-Guided Diffusion Models. In: CVPR, 2023. @inproceedings{HuCVPR2023,
title = {Self-Guided Diffusion Models},
author = {Vincent Tao Hu and David W Zhang and Yuki M Asano and Gertjan J Burghouts and Cees G M Snoek},
url = {https://arxiv.org/abs/2210.06462
http://taohu.me/sgdm/},
year = {2023},
date = {2023-02-28},
urldate = {2023-02-28},
booktitle = {CVPR},
abstract = {Diffusion models have demonstrated remarkable progress in image generation quality, especially when guidance is used to control the generative process. However, guidance requires a large amount of image-annotation pairs for training and is thus dependent on their availability, correctness and unbiasedness. In this paper, we eliminate the need for such annotation by instead leveraging the flexibility of self-supervision signals to design a framework for self-guided diffusion models. By leveraging a feature extraction function and a self-annotation function, our method provides guidance signals at various image granularities: from the level of holistic images to object boxes and even segmentation masks. Our experiments on single-label and multi-label image datasets demonstrate that self-labeled guidance always outperforms diffusion models without guidance and may even surpass guidance based on ground-truth labels, especially on unbalanced data. When equipped with self-supervised box or mask proposals, our method further generates visually diverse yet semantically consistent images, without the need for any class, box, or segment label annotation. Self-guided diffusion is simple, flexible and expected to profit from deployment at scale.},
howpublished = {arXiv:2210.06462},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Diffusion models have demonstrated remarkable progress in image generation quality, especially when guidance is used to control the generative process. However, guidance requires a large amount of image-annotation pairs for training and is thus dependent on their availability, correctness and unbiasedness. In this paper, we eliminate the need for such annotation by instead leveraging the flexibility of self-supervision signals to design a framework for self-guided diffusion models. By leveraging a feature extraction function and a self-annotation function, our method provides guidance signals at various image granularities: from the level of holistic images to object boxes and even segmentation masks. Our experiments on single-label and multi-label image datasets demonstrate that self-labeled guidance always outperforms diffusion models without guidance and may even surpass guidance based on ground-truth labels, especially on unbalanced data. When equipped with self-supervised box or mask proposals, our method further generates visually diverse yet semantically consistent images, without the need for any class, box, or segment label annotation. Self-guided diffusion is simple, flexible and expected to profit from deployment at scale. |
 | Piyush Bagad, Makarand Tapaswi, Cees G M Snoek: Test of Time: Instilling Video-Language Models with a Sense of Time. In: CVPR, 2023. @inproceedings{BagadCVPR2023,
title = {Test of Time: Instilling Video-Language Models with a Sense of Time},
author = {Piyush Bagad and Makarand Tapaswi and Cees G M Snoek},
url = {https://arxiv.org/abs/2301.02074
https://bpiyush.github.io/testoftime-website/
https://github.com/bpiyush/TestOfTime},
year = {2023},
date = {2023-02-28},
urldate = {2023-02-28},
booktitle = {CVPR},
abstract = {Modeling and understanding time remains a challenge in contemporary video understanding models. With language emerging as a key driver towards powerful generalization, it is imperative for foundational video-language models to have a sense of time. In this paper, we consider a specific aspect of temporal understanding: consistency of time order as elicited by before/after relations. We establish that six existing video-language models struggle to understand even such simple temporal relations. We then question whether it is feasible to equip these foundational models with temporal awareness without re-training them from scratch. Towards this, we propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data. We conduct a zero-shot evaluation of the adapted models on six datasets for three downstream tasks which require a varying degree of time awareness. We observe encouraging performance gains especially when the task needs higher time awareness. Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch.},
howpublished = {arXiv:2301.02074},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Modeling and understanding time remains a challenge in contemporary video understanding models. With language emerging as a key driver towards powerful generalization, it is imperative for foundational video-language models to have a sense of time. In this paper, we consider a specific aspect of temporal understanding: consistency of time order as elicited by before/after relations. We establish that six existing video-language models struggle to understand even such simple temporal relations. We then question whether it is feasible to equip these foundational models with temporal awareness without re-training them from scratch. Towards this, we propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data. We conduct a zero-shot evaluation of the adapted models on six datasets for three downstream tasks which require a varying degree of time awareness. We observe encouraging performance gains especially when the task needs higher time awareness. Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch. |
 | Yingjun Du, Jiayi Shen, Xiantong Zhen, Cees G M Snoek: SuperDisco: Super-Class Discovery Improves Visual Recognition for the Long-Tail. In: CVPR, 2023. @inproceedings{DuCVPR2023,
title = {SuperDisco: Super-Class Discovery Improves Visual Recognition for the Long-Tail},
author = {Yingjun Du and Jiayi Shen and Xiantong Zhen and Cees G M Snoek},
url = {https://arxiv.org/abs/2304.00101
https://github.com/YDU-uva/SuperDisco},
year = {2023},
date = {2023-02-28},
urldate = {2023-02-28},
booktitle = {CVPR},
abstract = {Modern image classifiers perform well on populated classes, while degrading considerably on tail classes with only a few instances. Humans, by contrast, effortlessly handle the long-tailed recognition challenge, since they can learn the tail representation based on different levels of semantic abstraction, making the learned tail features more discriminative. This phenomenon motivated us to propose SuperDisco, an algorithm that discovers super-class representations for long-tailed recognition using a graph model. We learn to construct the super-class graph to guide the representation learning to deal with long-tailed distributions. Through message passing on the super-class graph, image representations are rectified and refined by attending to the most relevant entities based on the semantic similarity among their super-classes. Moreover, we propose to meta-learn the super-class graph under the supervision of a prototype graph constructed from a small amount of imbalanced data. By doing so, we obtain a more robust super-class graph that further improves the long-tailed recognition performance. The consistent state-of-the-art experiments on the long-tailed CIFAR-100, ImageNet, Places and iNaturalist demonstrate the benefit of the discovered super-class graph for dealing with long-tailed distributions.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Modern image classifiers perform well on populated classes, while degrading considerably on tail classes with only a few instances. Humans, by contrast, effortlessly handle the long-tailed recognition challenge, since they can learn the tail representation based on different levels of semantic abstraction, making the learned tail features more discriminative. This phenomenon motivated us to propose SuperDisco, an algorithm that discovers super-class representations for long-tailed recognition using a graph model. We learn to construct the super-class graph to guide the representation learning to deal with long-tailed distributions. Through message passing on the super-class graph, image representations are rectified and refined by attending to the most relevant entities based on the semantic similarity among their super-classes. Moreover, we propose to meta-learn the super-class graph under the supervision of a prototype graph constructed from a small amount of imbalanced data. By doing so, we obtain a more robust super-class graph that further improves the long-tailed recognition performance. The consistent state-of-the-art experiments on the long-tailed CIFAR-100, ImageNet, Places and iNaturalist demonstrate the benefit of the discovered super-class graph for dealing with long-tailed distributions. |
 | Hossein Mirzaei, Mohammadreza Salehi, Sajjad Shahabi, Efstratios Gavves, Cees G M Snoek, Mohammad Sabokrou, Mohammad Hossein Rohban: Fake It Till You Make It: Towards Accurate Near-Distribution Novelty Detection. In: ICLR, 2023. @inproceedings{MirzaeiICLR2023,
title = {Fake It Till You Make It: Towards Accurate Near-Distribution Novelty Detection},
author = {Hossein Mirzaei and Mohammadreza Salehi and Sajjad Shahabi and Efstratios Gavves and Cees G M Snoek and Mohammad Sabokrou and Mohammad Hossein Rohban},
url = {https://arxiv.org/abs/2205.14297},
year = {2023},
date = {2023-01-21},
urldate = {2022-01-21},
booktitle = {ICLR},
abstract = {We aim for image-based novelty detection. Despite considerable progress, existing models either fail or face a dramatic drop under the so-called "near-distribution" setting, where the differences between normal and anomalous samples are subtle. We first demonstrate existing methods experience up to 20% decrease in performance in the near-distribution setting. Next, we propose to exploit a score-based generative model to produce synthetic near-distribution anomalous data. Our model is then fine-tuned to distinguish such data from the normal samples. We provide a quantitative as well as qualitative evaluation of this strategy, and compare the results with a variety of GAN-based models. Effectiveness of our method for both the near-distribution and standard novelty detection is assessed through extensive experiments on datasets in diverse applications such as medical images, object classification, and quality control. This reveals that our method considerably improves over existing models, and consistently decreases the gap between the near-distribution and standard novelty detection performance. The code repository is available at https://github.com/rohban-lab/FITYMI.},
howpublished = {arXiv:2205.14297},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We aim for image-based novelty detection. Despite considerable progress, existing models either fail or face a dramatic drop under the so-called "near-distribution" setting, where the differences between normal and anomalous samples are subtle. We first demonstrate existing methods experience up to 20% decrease in performance in the near-distribution setting. Next, we propose to exploit a score-based generative model to produce synthetic near-distribution anomalous data. Our model is then fine-tuned to distinguish such data from the normal samples. We provide a quantitative as well as qualitative evaluation of this strategy, and compare the results with a variety of GAN-based models. Effectiveness of our method for both the near-distribution and standard novelty detection is assessed through extensive experiments on datasets in diverse applications such as medical images, object classification, and quality control. This reveals that our method considerably improves over existing models, and consistently decreases the gap between the near-distribution and standard novelty detection performance. The code repository is available at https://github.com/rohban-lab/FITYMI. |
 | Zehao Xiao, Xiantong Zhen, Shengcai Liao, Cees G M Snoek: Energy-Based Test Sample Adaptation for Domain Generalization. In: ICLR, 2023. @inproceedings{XiaoICLR2023,
title = {Energy-Based Test Sample Adaptation for Domain Generalization},
author = {Zehao Xiao and Xiantong Zhen and Shengcai Liao and Cees G M Snoek},
url = {https://arxiv.org/abs/2302.11215
https://github.com/zzzx1224/EBTSA-ICLR2023},
year = {2023},
date = {2023-01-21},
urldate = {2023-01-21},
booktitle = {ICLR},
abstract = {In this paper, we propose energy-based sample adaptation at test time for domain generalization. Where previous works adapt their models to target domains, we adapt the unseen target samples to source-trained models. To this end, we design a discriminative energy-based model, which is trained on source domains to jointly model the conditional distribution for classification and data distribution for sample adaptation. The model is optimized to simultaneously learn a classifier and an energy function. To adapt target samples to source distributions, we iteratively update the samples by energy minimization with stochastic gradient Langevin dynamics. Moreover, to preserve the categorical information in the sample during adaptation, we introduce a categorical latent variable into the energy-based model. The latent variable is learned from the original sample before adaptation by variational inference and fixed as a condition to guide the sample update. Experiments on six benchmarks for classification of images and microblog threads demonstrate the effectiveness of our proposal. },
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper, we propose energy-based sample adaptation at test time for domain generalization. Where previous works adapt their models to target domains, we adapt the unseen target samples to source-trained models. To this end, we design a discriminative energy-based model, which is trained on source domains to jointly model the conditional distribution for classification and data distribution for sample adaptation. The model is optimized to simultaneously learn a classifier and an energy function. To adapt target samples to source distributions, we iteratively update the samples by energy minimization with stochastic gradient Langevin dynamics. Moreover, to preserve the categorical information in the sample during adaptation, we introduce a categorical latent variable into the energy-based model. The latent variable is learned from the original sample before adaptation by variational inference and fixed as a condition to guide the sample update. Experiments on six benchmarks for classification of images and microblog threads demonstrate the effectiveness of our proposal. |
 | Wim Bernasco, Evelien Hoeben, Dennis Koelma, Lasse Suonperä Liebst, Josephine Thomas, Joska Appelman, Cees Snoek, Marie Rosenkrantz Lindegaard: Promise Into Practice: Application of Computer Vision in Empirical Research on Social Distancing. In: Sociological Methods and Research, vol. 52, iss. 3, pp. 1239–1287, 2023. @article{BernascoSMR2023,
title = {Promise Into Practice: Application of Computer Vision in Empirical Research on Social Distancing},
author = {Wim Bernasco and Evelien Hoeben and Dennis Koelma and Lasse Suonperä Liebst and Josephine Thomas and Joska Appelman and Cees Snoek and Marie Rosenkrantz Lindegaard},
url = {https://osf.io/ex9fy/},
year = {2023},
date = {2023-01-01},
urldate = {2023-01-01},
journal = {Sociological Methods and Research},
volume = {52},
issue = {3},
pages = {1239–1287},
abstract = {Social scientists increasingly use video data, but large-scale analysis of its content is often constrained by scarce manual coding resources. Upscaling may be possible with the application of automated coding procedures, which are being developed in the field of computer vision. Here, we introduce computer vision to social scientists, review the state-of-the-art in relevant subfields, and provide a working example of how computer vision can be applied in empirical sociological work. Our application involves defining a ground truth by human coders, developing an algorithm for automated coding, testing the performance of the algorithm against the ground truth, and run the algorithm on a large-scale dataset of CCTV images. The working example concerns monitoring social distancing behavior in public space over more than a year of the COVID-19 pandemic. Finally, we discuss prospects for the use of computer vision in empirical social science research and address technical and ethical limitations.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Social scientists increasingly use video data, but large-scale analysis of its content is often constrained by scarce manual coding resources. Upscaling may be possible with the application of automated coding procedures, which are being developed in the field of computer vision. Here, we introduce computer vision to social scientists, review the state-of-the-art in relevant subfields, and provide a working example of how computer vision can be applied in empirical sociological work. Our application involves defining a ground truth by human coders, developing an algorithm for automated coding, testing the performance of the algorithm against the ground truth, and run the algorithm on a large-scale dataset of CCTV images. The working example concerns monitoring social distancing behavior in public space over more than a year of the COVID-19 pandemic. Finally, we discuss prospects for the use of computer vision in empirical social science research and address technical and ethical limitations. |
2022
|
 | David W Zhang, Gertjan J Burghouts, Cees G M Snoek: Pruning Edges and Gradients to Learn Hypergraphs from Larger Sets. In: LoG, 2022. @inproceedings{ZhangLOG2022,
title = {Pruning Edges and Gradients to Learn Hypergraphs from Larger Sets},
author = {David W Zhang and Gertjan J Burghouts and Cees G M Snoek},
url = {https://isis-data.science.uva.nl/cgmsnoek/pub/zhang-hypergraphs-log2022.pdf
https://github.com/davzha/recurrently_predicting_hypergraphs},
year = {2022},
date = {2022-12-09},
urldate = {2022-12-09},
booktitle = {LoG},
abstract = {This paper aims for set-to-hypergraph prediction, where the goal is to infer the set of relations for a given set of entities. This is a common abstraction for applications in particle physics, biological systems and combinatorial optimization. We address two common scaling problems encountered in set-to-hypergraph tasks that limit the size of the input set: the exponentially growing number of hyperedges and the run-time complexity, both leading to higher memory requirements. We make three contributions. First, we propose to predict and supervise the positive edges only, which changes the asymptotic memory scaling from exponential to linear. Second, we introduce a training method that encourages iterative refinement of the predicted hypergraph, which allows us to skip iterations in the backward pass for improved efficiency and constant memory usage. Third, we combine both contributions in a single set-to-hypergraph model that enables us to address problems with larger input set sizes. We provide ablations for our main technical contributions and show that our model outperforms prior state-of-the-art, especially for larger sets.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper aims for set-to-hypergraph prediction, where the goal is to infer the set of relations for a given set of entities. This is a common abstraction for applications in particle physics, biological systems and combinatorial optimization. We address two common scaling problems encountered in set-to-hypergraph tasks that limit the size of the input set: the exponentially growing number of hyperedges and the run-time complexity, both leading to higher memory requirements. We make three contributions. First, we propose to predict and supervise the positive edges only, which changes the asymptotic memory scaling from exponential to linear. Second, we introduce a training method that encourages iterative refinement of the predicted hypergraph, which allows us to skip iterations in the backward pass for improved efficiency and constant memory usage. Third, we combine both contributions in a single set-to-hypergraph model that enables us to address problems with larger input set sizes. We provide ablations for our main technical contributions and show that our model outperforms prior state-of-the-art, especially for larger sets. |
 | Mengmeng Jing, Xiantong Zhen, Jingjing Li, Cees G. M. Snoek: Variational Model Perturbation for Source-Free Domain Adaptation. In: NeurIPS, 2022. @inproceedings{JingNeurIPS2022,
title = {Variational Model Perturbation for Source-Free Domain Adaptation},
author = {Mengmeng Jing and Xiantong Zhen and Jingjing Li and Cees G. M. Snoek},
url = {https://github.com/mmjing/Variational_Model_Perturbation
https://arxiv.org/abs/2210.10378},
year = {2022},
date = {2022-12-01},
urldate = {2022-12-01},
booktitle = {NeurIPS},
abstract = {We aim for source-free domain adaptation, where the task is to deploy a model pre-trained on source domains to target domains. The challenges stem from the distribution shift from the source to the target domain, coupled with the unavailability of any source data and labeled target data for optimization. Rather than fine-tuning the model by updating the parameters, we propose to perturb the source model to achieve adaptation to target domains. We introduce perturbations into the model parameters by variational Bayesian inference in a probabilistic framework. By doing so, we can effectively adapt the model to the target domain while largely preserving the discriminative ability. Importantly, we demonstrate the theoretical connection to learning Bayesian neural networks, which proves the generalizability of the perturbed model to target domains. To enable more efficient optimization, we further employ a parameter sharing strategy, which substantially reduces the learnable parameters compared to a fully Bayesian neural network. Our model perturbation provides a new probabilistic way for domain adaptation which enables efficient adaptation to target domains while maximally preserving knowledge in source models. Experiments on several source-free benchmarks under three different evaluation settings verify the effectiveness of the proposed variational model perturbation for source-free domain adaptation.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We aim for source-free domain adaptation, where the task is to deploy a model pre-trained on source domains to target domains. The challenges stem from the distribution shift from the source to the target domain, coupled with the unavailability of any source data and labeled target data for optimization. Rather than fine-tuning the model by updating the parameters, we propose to perturb the source model to achieve adaptation to target domains. We introduce perturbations into the model parameters by variational Bayesian inference in a probabilistic framework. By doing so, we can effectively adapt the model to the target domain while largely preserving the discriminative ability. Importantly, we demonstrate the theoretical connection to learning Bayesian neural networks, which proves the generalizability of the perturbed model to target domains. To enable more efficient optimization, we further employ a parameter sharing strategy, which substantially reduces the learnable parameters compared to a fully Bayesian neural network. Our model perturbation provides a new probabilistic way for domain adaptation which enables efficient adaptation to target domains while maximally preserving knowledge in source models. Experiments on several source-free benchmarks under three different evaluation settings verify the effectiveness of the proposed variational model perturbation for source-free domain adaptation. |
 | Jiayi Shen, Zehao Xiao, Xiantong Zhen, Cees G. M. Snoek, Marcel Worring: Association Graph Learning for Multi-Task Classification with Category Shifts. In: NeurIPS, 2022. @inproceedings{ShenNeurIPS2022,
title = {Association Graph Learning for Multi-Task Classification with Category Shifts},
author = {Jiayi Shen and Zehao Xiao and Xiantong Zhen and Cees G. M. Snoek and Marcel Worring},
url = {https://arxiv.org/abs/2210.04637
https://github.com/autumn9999/MTC-with-Category-Shifts.git},
year = {2022},
date = {2022-12-01},
urldate = {2022-12-01},
booktitle = {NeurIPS},
abstract = {In this paper, we focus on multi-task classification, where related classification tasks share the same label space and are learned simultaneously. In particular, we tackle a new setting, which is more realistic than currently addressed in the literature, where categories shift from training to test data. Hence, individual tasks do not contain complete training data for the categories in the test set. To generalize to such test data, it is crucial for individual tasks to leverage knowledge from related tasks. To this end, we propose learning an association graph to transfer knowledge among tasks for missing classes. We construct the association graph with nodes representing tasks, classes and instances, and encode the relationships among the nodes in the edges to guide their mutual knowledge transfer. By message passing on the association graph, our model enhances the categorical information of each instance, making it more discriminative. To avoid spurious correlations between task and class nodes in the graph, we introduce an assignment entropy maximization that encourages each class node to balance its edge weights. This enables all tasks to fully utilize the categorical information from related tasks. An extensive evaluation on three general benchmarks and a medical dataset for skin lesion classification reveals that our method consistently performs better than representative baselines.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper, we focus on multi-task classification, where related classification tasks share the same label space and are learned simultaneously. In particular, we tackle a new setting, which is more realistic than currently addressed in the literature, where categories shift from training to test data. Hence, individual tasks do not contain complete training data for the categories in the test set. To generalize to such test data, it is crucial for individual tasks to leverage knowledge from related tasks. To this end, we propose learning an association graph to transfer knowledge among tasks for missing classes. We construct the association graph with nodes representing tasks, classes and instances, and encode the relationships among the nodes in the edges to guide their mutual knowledge transfer. By message passing on the association graph, our model enhances the categorical information of each instance, making it more discriminative. To avoid spurious correlations between task and class nodes in the graph, we introduce an assignment entropy maximization that encourages each class node to balance its edge weights. This enables all tasks to fully utilize the categorical information from related tasks. An extensive evaluation on three general benchmarks and a medical dataset for skin lesion classification reveals that our method consistently performs better than representative baselines. |
 | Fida Mohammad Thoker, Hazel Doughty, Piyush Bagad, Cees G M Snoek: How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?. In: ECCV, 2022. @inproceedings{ThokerECCV2022,
title = {How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?},
author = {Fida Mohammad Thoker and Hazel Doughty and Piyush Bagad and Cees G M Snoek},
url = {https://arxiv.org/abs/2203.14221
https://bpiyush.github.io/SEVERE-website/
https://github.com/fmthoker/SEVERE-BENCHMARK},
year = {2022},
date = {2022-10-24},
urldate = {2022-10-24},
booktitle = {ECCV},
abstract = {Despite the recent success of video self-supervised learning, there is much still to be understood about their generalization capability. In this paper, we investigate how sensitive video self-supervised learning is to the currently used benchmark convention and whether methods generalize beyond the canonical evaluation setting. We do this across four different factors of sensitivity: domain, samples, actions and task. Our comprehensive set of over 500 experiments, which encompasses 7 video datasets, 9 self-supervised methods and 6 video understanding tasks, reveals that current benchmarks in video self-supervised learning are not a good indicator of generalization along these sensitivity factors. Further, we find that self-supervised methods considerably lag behind vanilla supervised pre-training, especially when domain shift is large and the amount of available downstream samples are low. From our analysis we distill the SEVERE-benchmark, a subset of our experiments, and discuss its implication for evaluating the generalizability of representations obtained by existing and future self-supervised video learning methods.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Despite the recent success of video self-supervised learning, there is much still to be understood about their generalization capability. In this paper, we investigate how sensitive video self-supervised learning is to the currently used benchmark convention and whether methods generalize beyond the canonical evaluation setting. We do this across four different factors of sensitivity: domain, samples, actions and task. Our comprehensive set of over 500 experiments, which encompasses 7 video datasets, 9 self-supervised methods and 6 video understanding tasks, reveals that current benchmarks in video self-supervised learning are not a good indicator of generalization along these sensitivity factors. Further, we find that self-supervised methods considerably lag behind vanilla supervised pre-training, especially when domain shift is large and the amount of available downstream samples are low. From our analysis we distill the SEVERE-benchmark, a subset of our experiments, and discuss its implication for evaluating the generalizability of representations obtained by existing and future self-supervised video learning methods. |
 | Pengwan Yang, Yuki M Asano, Pascal Mettes, Cees G M Snoek: Less than Few: Self-Shot Video Instance Segmentation. In: ECCV, 2022. @inproceedings{YangECCV22,
title = {Less than Few: Self-Shot Video Instance Segmentation},
author = {Pengwan Yang and Yuki M Asano and Pascal Mettes and Cees G M Snoek},
url = {https://isis-data.science.uva.nl/cgmsnoek/pub/yang-selfshot-eccv2022.pdf
https://github.com/PengWan-Yang/self-shot},
year = {2022},
date = {2022-10-24},
urldate = {2022-10-24},
booktitle = {ECCV},
abstract = {The goal of this paper is to bypass the need for labelled examples in few-shot video understanding at run time. While proven effective, in many practical video settings even labelling a few examples appears unrealistic. This is especially true as the level of details in spatio-temporal video understanding and with it, the complexity of annotations continues to increase. Rather than performing few-shot learning with a human oracle to provide a few densely labelled support videos, we propose to automatically learn to find appropriate support videos given a query. We call this self-shot learning and we outline a simple self-supervised learning method to generate an embedding space well-suited for unsupervised retrieval of relevant samples. To showcase this novel setting, we tackle, for the first time, video instance segmentation in a self-shot (and few-shot) setting, where the goal is to segment instances at the pixel-level across the spatial and temporal domains. We provide strong baseline performances that utilize a novel transformer-based model and show that self-shot learning can even surpass few-shot and can be positively combined for further performance gains. Experiments on new benchmarks show that our approach achieves strong performance, is competitive to oracle support in some settings, scales to large unlabelled video collections, and can be combined in a semi-supervised setting.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
The goal of this paper is to bypass the need for labelled examples in few-shot video understanding at run time. While proven effective, in many practical video settings even labelling a few examples appears unrealistic. This is especially true as the level of details in spatio-temporal video understanding and with it, the complexity of annotations continues to increase. Rather than performing few-shot learning with a human oracle to provide a few densely labelled support videos, we propose to automatically learn to find appropriate support videos given a query. We call this self-shot learning and we outline a simple self-supervised learning method to generate an embedding space well-suited for unsupervised retrieval of relevant samples. To showcase this novel setting, we tackle, for the first time, video instance segmentation in a self-shot (and few-shot) setting, where the goal is to segment instances at the pixel-level across the spatial and temporal domains. We provide strong baseline performances that utilize a novel transformer-based model and show that self-shot learning can even surpass few-shot and can be positively combined for further performance gains. Experiments on new benchmarks show that our approach achieves strong performance, is competitive to oracle support in some settings, scales to large unlabelled video collections, and can be combined in a semi-supervised setting. |
 | Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Tom van Sonsbeek, Xiantong Zhen, Dwarikanath Mahapatra, Marcel Worring, Cees G M Snoek: LifeLonger: A Benchmark for Continual Disease Classification. In: MICCAI, Singapore, 2022. @inproceedings{DerakhshaniMICCAI2022,
title = {LifeLonger: A Benchmark for Continual Disease Classification},
author = {Mohammad Mahdi Derakhshani and Ivona Najdenkoska and Tom van Sonsbeek and Xiantong Zhen and Dwarikanath Mahapatra and Marcel Worring and Cees G M Snoek},
url = {https://arxiv.org/abs/2204.05737
https://github.com/mmderakhshani/LifeLonger},
year = {2022},
date = {2022-09-18},
urldate = {2022-09-18},
booktitle = {MICCAI},
address = {Singapore},
abstract = {Deep learning models have shown a great effectiveness in recognition of findings in medical images. However, they cannot handle the ever-changing clinical environment, bringing newly annotated medical data from different sources. To exploit the incoming streams of data, these models would benefit largely from sequentially learning from new samples, without forgetting the previously obtained knowledge.
In this paper we introduce LifeLonger, a benchmark for continual disease classification on the MedMNIST collection, by applying existing state-of-the-art continual learning methods. In particular, we consider three continual learning scenarios, namely, task and class incremental learning and the newly defined cross-domain incremental learning. Task and class incremental learning of diseases address the issue of classifying new samples without re-training the models from scratch, while cross-domain incremental learning addresses the issue of dealing with datasets originating from different institutions while retaining the previously obtained knowledge. We perform a thorough analysis of the performance and examine how the well-known challenges of continual learning, such as the catastrophic forgetting exhibit themselves in this setting. The encouraging results demonstrate that continual learning has a major potential to advance disease classification and to produce a more robust and efficient learning framework for clinical settings. The code repository, data partitions and baseline results for the complete benchmark are publicly available.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Deep learning models have shown a great effectiveness in recognition of findings in medical images. However, they cannot handle the ever-changing clinical environment, bringing newly annotated medical data from different sources. To exploit the incoming streams of data, these models would benefit largely from sequentially learning from new samples, without forgetting the previously obtained knowledge.
In this paper we introduce LifeLonger, a benchmark for continual disease classification on the MedMNIST collection, by applying existing state-of-the-art continual learning methods. In particular, we consider three continual learning scenarios, namely, task and class incremental learning and the newly defined cross-domain incremental learning. Task and class incremental learning of diseases address the issue of classifying new samples without re-training the models from scratch, while cross-domain incremental learning addresses the issue of dealing with datasets originating from different institutions while retaining the previously obtained knowledge. We perform a thorough analysis of the performance and examine how the well-known challenges of continual learning, such as the catastrophic forgetting exhibit themselves in this setting. The encouraging results demonstrate that continual learning has a major potential to advance disease classification and to produce a more robust and efficient learning framework for clinical settings. The code repository, data partitions and baseline results for the complete benchmark are publicly available. |
 | Hazel Doughty, Cees G M Snoek: How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs. In: CVPR, 2022. @inproceedings{DoughtyCVPR2022,
title = {How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs},
author = {Hazel Doughty and Cees G M Snoek},
url = {https://hazeldoughty.github.io/Papers/PseudoAdverbs/PseudoAdverbs.pdf
https://hazeldoughty.github.io/Papers/PseudoAdverbs/
https://github.com/hazeld/pseudoadverbs},
year = {2022},
date = {2022-06-03},
urldate = {2022-06-03},
booktitle = {CVPR},
abstract = {We aim to understand how actions are performed and identify subtle differences, such as `fold firmly' vs. `fold gently'. To this end, we propose a method which recognizes adverbs across different actions. However, such fine-grained annotations are difficult to obtain and their long-tailed nature makes it challenging to recognize adverbs in rare action-adverb compositions. Our approach therefore uses semi-supervised learning with multiple adverb pseudo-labels to leverage videos with only action labels. Combined with adaptive thresholding of these pseudo-adverbs we are able to make efficient use of the available data while tackling the long-tailed distribution. Additionally, we gather adverb annotations for three existing video retrieval datasets, which allows us to introduce the new tasks of recognizing adverbs in unseen action-adverb compositions and unseen domains. Experiments demonstrate the effectiveness of our method, which outperforms prior work in recognizing adverbs and semi-supervised works adapted for adverb recognition. We also show how adverbs can relate fine-grained actions.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We aim to understand how actions are performed and identify subtle differences, such as `fold firmly' vs. `fold gently'. To this end, we propose a method which recognizes adverbs across different actions. However, such fine-grained annotations are difficult to obtain and their long-tailed nature makes it challenging to recognize adverbs in rare action-adverb compositions. Our approach therefore uses semi-supervised learning with multiple adverb pseudo-labels to leverage videos with only action labels. Combined with adaptive thresholding of these pseudo-adverbs we are able to make efficient use of the available data while tackling the long-tailed distribution. Additionally, we gather adverb annotations for three existing video retrieval datasets, which allows us to introduce the new tasks of recognizing adverbs in unseen action-adverb compositions and unseen domains. Experiments demonstrate the effectiveness of our method, which outperforms prior work in recognizing adverbs and semi-supervised works adapted for adverb recognition. We also show how adverbs can relate fine-grained actions. |
 | Duy-Kien Nguyen, Jihong Ju, Olaf Booij, Martin R Oswald, Cees G M Snoek: BoxeR: Box-Attention for 2D and 3D Transformers. In: CVPR, 2022. @inproceedings{NguyenCVPR2022,
title = {BoxeR: Box-Attention for 2D and 3D Transformers},
author = {Duy-Kien Nguyen and Jihong Ju and Olaf Booij and Martin R Oswald and Cees G M Snoek},
url = {https://arxiv.org/abs/2111.13087
https://github.com/kienduynguyen/BoxeR},
year = {2022},
date = {2022-06-02},
urldate = {2022-06-02},
booktitle = {CVPR},
abstract = {In this paper, we propose a simple attention mechanism, we call Box-Attention. It enables spatial interaction between grid features, as sampled from boxes of interest, and improves the learning capability of transformers for several vision tasks. Specifically, we present BoxeR, short for Box Transformer, which attends to a set of boxes by predicting their transformation from a reference window on an input feature map. The BoxeR computes attention weights on these boxes by considering its grid structure. Notably, BoxeR-2D naturally reasons about box information within its attention module, making it suitable for end-to-end instance detection and segmentation tasks. By learning invariance to rotation in the box-attention module, BoxeR-3D is capable of generating discriminative information from a bird-eye-view plane for 3D end-to-end object detection. Our experiments demonstrate that the proposed BoxeR-2D achieves better results on COCO detection, and reaches comparable performance with well-established and highly-optimized Mask R-CNN on COCO instance segmentation. BoxeR-3D already obtains a compelling performance for the vehicle category of Waymo Open, without any class-specific optimization. The code will be released.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper, we propose a simple attention mechanism, we call Box-Attention. It enables spatial interaction between grid features, as sampled from boxes of interest, and improves the learning capability of transformers for several vision tasks. Specifically, we present BoxeR, short for Box Transformer, which attends to a set of boxes by predicting their transformation from a reference window on an input feature map. The BoxeR computes attention weights on these boxes by considering its grid structure. Notably, BoxeR-2D naturally reasons about box information within its attention module, making it suitable for end-to-end instance detection and segmentation tasks. By learning invariance to rotation in the box-attention module, BoxeR-3D is capable of generating discriminative information from a bird-eye-view plane for 3D end-to-end object detection. Our experiments demonstrate that the proposed BoxeR-2D achieves better results on COCO detection, and reaches comparable performance with well-established and highly-optimized Mask R-CNN on COCO instance segmentation. BoxeR-3D already obtains a compelling performance for the vehicle category of Waymo Open, without any class-specific optimization. The code will be released. |
 | Yunhua Zhang, Hazel Doughty, Ling Shao, Cees G M Snoek: Audio-Adaptive Activity Recognition Across Video Domains. In: CVPR, 2022. @inproceedings{ZhangCVPR2022,
title = {Audio-Adaptive Activity Recognition Across Video Domains},
author = {Yunhua Zhang and Hazel Doughty and Ling Shao and Cees G M Snoek},
url = {https://arxiv.org/abs/2203.14240
https://xiaobai1217.github.io/DomainAdaptation/
https://github.com/xiaobai1217/DomainAdaptation
},
year = {2022},
date = {2022-06-02},
urldate = {2022-06-02},
booktitle = {CVPR},
abstract = {This paper strives for activity recognition under domain shift, for example caused by change of scenery or camera viewpoint. The leading approaches reduce the shift in activity appearance by adversarial training and self-supervised learning. Different from these vision-focused works we leverage activity sounds for domain adaptation as they have less variance across domains and can reliably indicate which activities are not happening. We propose an audio-adaptive encoder and associated learning methods that discriminatively adjust the visual feature representation as well as addressing shifts in the semantic distribution. To further eliminate domain-specific features and include domain-invariant activity sounds for recognition, an audio-infused recognizer is proposed, which effectively models the cross-modal interaction across domains. We also introduce the new task of actor shift, with a corresponding audio-visual dataset, to challenge our method with situations where the activity appearance changes dramatically. Experiments on this dataset, EPIC-Kitchens and CharadesEgo show the effectiveness of our approach.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper strives for activity recognition under domain shift, for example caused by change of scenery or camera viewpoint. The leading approaches reduce the shift in activity appearance by adversarial training and self-supervised learning. Different from these vision-focused works we leverage activity sounds for domain adaptation as they have less variance across domains and can reliably indicate which activities are not happening. We propose an audio-adaptive encoder and associated learning methods that discriminatively adjust the visual feature representation as well as addressing shifts in the semantic distribution. To further eliminate domain-specific features and include domain-invariant activity sounds for recognition, an audio-infused recognizer is proposed, which effectively models the cross-modal interaction across domains. We also introduce the new task of actor shift, with a corresponding audio-visual dataset, to challenge our method with situations where the activity appearance changes dramatically. Experiments on this dataset, EPIC-Kitchens and CharadesEgo show the effectiveness of our approach. |
 | Jiaojiao Zhao, Yanyi Zhang, Xinyu Li, Hao Chen, Shuai Bing, Mingze Xu, Chunhui Liu, Kaustav Kundu, Yuanjun Xiong, Davide Modolo, Ivan Marsic, Cees G M Snoek, Joseph Tighe: TubeR: Tubelet Transformer for Video Action Detection. In: CVPR, 2022, (Oral presentation, top 4.2%). @inproceedings{ZhaoCVPR2022,
title = {TubeR: Tubelet Transformer for Video Action Detection},
author = {Jiaojiao Zhao and Yanyi Zhang and Xinyu Li and Hao Chen and Shuai Bing and Mingze Xu and Chunhui Liu and Kaustav Kundu and Yuanjun Xiong and Davide Modolo and Ivan Marsic and Cees G M Snoek and Joseph Tighe},
url = {https://arxiv.org/abs/2104.00969
https://github.com/amazon-research/tubelet-transformer},
year = {2022},
date = {2022-06-01},
urldate = {2022-06-01},
booktitle = {CVPR},
abstract = {We propose TubeR: a simple solution for spatio-temporal video action detection. Different from existing methods that depend on either an off-line actor detector or hand-designed actor-positional hypotheses like proposals or anchors, we propose to directly detect an action tubelet in a video by simultaneously performing action localization and recognition from a single representation. TubeR learns a set of tubelet- queries and utilizes a tubelet-attention module to model the dynamic spatio-temporal nature of a video clip, which effectively reinforces the model capacity compared to using actor-positional hypotheses in the spatio-temporal space. For videos containing transitional states or scene changes, we propose a context aware classification head to utilize short-term and long-term context to strengthen action classification, and an action switch regression head for detecting the precise temporal action extent. TubeR directly produces action tubelets with variable lengths and even maintains good results for long video clips. TubeR outperforms the previous state-of-the-art on commonly used action detection datasets AVA, UCF101-24 and JHMDB51-21. },
note = {Oral presentation, top 4.2%},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We propose TubeR: a simple solution for spatio-temporal video action detection. Different from existing methods that depend on either an off-line actor detector or hand-designed actor-positional hypotheses like proposals or anchors, we propose to directly detect an action tubelet in a video by simultaneously performing action localization and recognition from a single representation. TubeR learns a set of tubelet- queries and utilizes a tubelet-attention module to model the dynamic spatio-temporal nature of a video clip, which effectively reinforces the model capacity compared to using actor-positional hypotheses in the spatio-temporal space. For videos containing transitional states or scene changes, we propose a context aware classification head to utilize short-term and long-term context to strengthen action classification, and an action switch regression head for detecting the precise temporal action extent. TubeR directly produces action tubelets with variable lengths and even maintains good results for long video clips. TubeR outperforms the previous state-of-the-art on commonly used action detection datasets AVA, UCF101-24 and JHMDB51-21. |
 | Yingjun Du, Xiantong Zhen, Ling Shao, Cees G M Snoek: Hierarchical Variational Memory for Few-shot Learning Across Domains. In: ICLR, Virtual, 2022. @inproceedings{DuICLR2022,
title = {Hierarchical Variational Memory for Few-shot Learning Across Domains},
author = {Yingjun Du and Xiantong Zhen and Ling Shao and Cees G M Snoek},
url = {https://arxiv.org/abs/2112.08181
https://github.com/YDU-uva/HVM},
year = {2022},
date = {2022-04-25},
urldate = {2022-04-25},
booktitle = {ICLR},
address = {Virtual},
abstract = {Neural memory enables fast adaptation to new tasks with just a few training samples. Existing memory models store features only from the single last layer, which does not generalize well in presence of a domain shift between training and test distributions. Rather than relying on a flat memory, we propose a hierarchical alternative that stores features at different semantic levels. We introduce a hierarchical prototype model, where each level of the prototype fetches corresponding information from the hierarchical memory. The model is endowed with the ability to flexibly rely on features at different semantic levels if the domain shift circumstances so demand. We meta-learn the model by a newly derived hierarchical variational inference framework, where hierarchical memory and prototypes are jointly optimized. To explore and exploit the importance of different semantic levels, we further propose to learn the weights associated with the prototype at each level in a data-driven way, which enables the model to adaptively choose the most generalizable features. We conduct thorough ablation studies to demonstrate the effectiveness of each component in our model. The new state-of-the-art performance on cross-domain and competitive performance on traditional few-shot classification further substantiates the benefit of hierarchical variational memory.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Neural memory enables fast adaptation to new tasks with just a few training samples. Existing memory models store features only from the single last layer, which does not generalize well in presence of a domain shift between training and test distributions. Rather than relying on a flat memory, we propose a hierarchical alternative that stores features at different semantic levels. We introduce a hierarchical prototype model, where each level of the prototype fetches corresponding information from the hierarchical memory. The model is endowed with the ability to flexibly rely on features at different semantic levels if the domain shift circumstances so demand. We meta-learn the model by a newly derived hierarchical variational inference framework, where hierarchical memory and prototypes are jointly optimized. To explore and exploit the importance of different semantic levels, we further propose to learn the weights associated with the prototype at each level in a data-driven way, which enables the model to adaptively choose the most generalizable features. We conduct thorough ablation studies to demonstrate the effectiveness of each component in our model. The new state-of-the-art performance on cross-domain and competitive performance on traditional few-shot classification further substantiates the benefit of hierarchical variational memory. |
 | Zehao Xiao, Xiantong Zhen, Ling Shao, Cees G M Snoek: Learning to Generalize across Domains on Single Test Samples. In: ICLR, Virtual, 2022. @inproceedings{XiaoICLR2022,
title = {Learning to Generalize across Domains on Single Test Samples},
author = {Zehao Xiao and Xiantong Zhen and Ling Shao and Cees G M Snoek},
url = {https://arxiv.org/abs/2202.08045
https://github.com/zzzx1224/SingleSampleGeneralization-ICLR2022},
year = {2022},
date = {2022-04-25},
urldate = {2022-04-25},
booktitle = {ICLR},
address = {Virtual},
abstract = {We strive to learn a model from a set of source domains that generalizes well to unseen target domains. The main challenge in such a domain generalization scenario is the unavailability of any target domain data during training, resulting in the learned model not being explicitly adapted to the unseen target domains. We propose learning to generalize across domains on single test samples. We leverage a meta-learning paradigm to learn our model to acquire the ability of adaptation with single samples at training time so as to further adapt itself to each single test sample at test time. We formulate the adaptation to the single test sample as a variational Bayesian inference problem, which incorporates the test sample as a conditional into the generation of model parameters. The adaptation to each test sample requires only one feed-forward computation at test time without any fine-tuning or self-supervised training on additional data from the unseen domains. Extensive ablation studies demonstrate that our model learns the ability to adapt models by mimicking domain shift during training. Further, our model achieves at least comparable -- and often better -- performance than state-of-the-art methods on multiple benchmarks for domain generalization},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We strive to learn a model from a set of source domains that generalizes well to unseen target domains. The main challenge in such a domain generalization scenario is the unavailability of any target domain data during training, resulting in the learned model not being explicitly adapted to the unseen target domains. We propose learning to generalize across domains on single test samples. We leverage a meta-learning paradigm to learn our model to acquire the ability of adaptation with single samples at training time so as to further adapt itself to each single test sample at test time. We formulate the adaptation to the single test sample as a variational Bayesian inference problem, which incorporates the test sample as a conditional into the generation of model parameters. The adaptation to each test sample requires only one feed-forward computation at test time without any fine-tuning or self-supervised training on additional data from the unseen domains. Extensive ablation studies demonstrate that our model learns the ability to adapt models by mimicking domain shift during training. Further, our model achieves at least comparable -- and often better -- performance than state-of-the-art methods on multiple benchmarks for domain generalization |
 | Yan Zhang, David W Zhang, Simon Lacoste-Julien, Gertjan J Burghouts, Cees G M Snoek: Multiset-Equivariant Set Prediction with Approximate Implicit Differentiation. In: ICLR, Virtual, 2022. @inproceedings{YanZhangICLR2022,
title = {Multiset-Equivariant Set Prediction with Approximate Implicit Differentiation},
author = {Yan Zhang and David W Zhang and Simon Lacoste-Julien and Gertjan J Burghouts and Cees G M Snoek},
url = {https://arxiv.org/abs/2111.12193
https://www.youtube.com/watch?v=xfVBZprO7g8
https://github.com/davzha/multiset-equivariance},
year = {2022},
date = {2022-04-24},
urldate = {2022-04-01},
booktitle = {ICLR},
address = {Virtual},
abstract = {Most set prediction models in deep learning use set-equivariant operations, but they actually operate on multisets. We show that set-equivariant functions cannot represent certain functions on multisets, so we introduce the more appropriate notion of multiset-equivariance. We identify that the existing Deep Set Prediction Network (DSPN) can be multiset-equivariant without being hindered by set-equivariance and improve it with approximate implicit differentiation, allowing for better optimization while being faster and saving memory. In a range of toy experiments, we show that the perspective of multiset-equivariance is beneficial and that our changes to DSPN achieve better results in most cases. On CLEVR object property prediction, we substantially improve over the state-of-the-art Slot Attention from 8% to 77% in one of the strictest evaluation metrics because of the benefits made possible by implicit differentiation.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Most set prediction models in deep learning use set-equivariant operations, but they actually operate on multisets. We show that set-equivariant functions cannot represent certain functions on multisets, so we introduce the more appropriate notion of multiset-equivariance. We identify that the existing Deep Set Prediction Network (DSPN) can be multiset-equivariant without being hindered by set-equivariance and improve it with approximate implicit differentiation, allowing for better optimization while being faster and saving memory. In a range of toy experiments, we show that the perspective of multiset-equivariance is beneficial and that our changes to DSPN achieve better results in most cases. On CLEVR object property prediction, we substantially improve over the state-of-the-art Slot Attention from 8% to 77% in one of the strictest evaluation metrics because of the benefits made possible by implicit differentiation. |
 | Zenglin Shi, Pascal Mettes, Subhransu Maji, Cees G M Snoek: On Measuring and Controlling the Spectral Bias of the Deep Image Prior. In: International Journal of Computer Vision, vol. 130, pp. 885–908, 2022. @article{ShiIJCV22,
title = {On Measuring and Controlling the Spectral Bias of the Deep Image Prior},
author = {Zenglin Shi and Pascal Mettes and Subhransu Maji and Cees G M Snoek},
url = {https://arxiv.org/abs/2107.01125
https://link.springer.com/article/10.1007/s11263-021-01572-7
https://github.com/shizenglin/Measure-and-Control-Spectral-Bias},
year = {2022},
date = {2022-04-01},
urldate = {2022-02-11},
journal = {International Journal of Computer Vision},
volume = {130},
pages = {885–908},
abstract = {The deep image prior showed that a randomly initialized network with a suitable architecture can be trained to solve inverse imaging problems by simply optimizing it's parameters to reconstruct a single degraded image. However, it suffers from two practical limitations. First, it remains unclear how to control the prior beyond the choice of the network architecture. Second, training requires an oracle stopping criterion as during the optimization the performance degrades after reaching an optimum value. To address these challenges we introduce a frequency-band correspondence measure to characterize the spectral bias of the deep image prior, where low-frequency image signals are learned faster and better than high-frequency counterparts. Based on our observations, we propose techniques to prevent the eventual performance degradation and accelerate convergence. We introduce a Lipschitz-controlled convolution layer and a Gaussian-controlled upsampling layer as plug-in replacements for layers used in the deep architectures. The experiments show that with these changes the performance does not degrade during optimization, relieving us from the need for an oracle stopping criterion. We further outline a stopping criterion to avoid superfluous computation. Finally, we show that our approach obtains favorable results compared to current approaches across various denoising, deblocking, inpainting, super-resolution and detail enhancement tasks. Code is available at https://github.com/shizenglin/Measure-and-Control-Spectral-Bias.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
The deep image prior showed that a randomly initialized network with a suitable architecture can be trained to solve inverse imaging problems by simply optimizing it's parameters to reconstruct a single degraded image. However, it suffers from two practical limitations. First, it remains unclear how to control the prior beyond the choice of the network architecture. Second, training requires an oracle stopping criterion as during the optimization the performance degrades after reaching an optimum value. To address these challenges we introduce a frequency-band correspondence measure to characterize the spectral bias of the deep image prior, where low-frequency image signals are learned faster and better than high-frequency counterparts. Based on our observations, we propose techniques to prevent the eventual performance degradation and accelerate convergence. We introduce a Lipschitz-controlled convolution layer and a Gaussian-controlled upsampling layer as plug-in replacements for layers used in the deep architectures. The experiments show that with these changes the performance does not degrade during optimization, relieving us from the need for an oracle stopping criterion. We further outline a stopping criterion to avoid superfluous computation. Finally, we show that our approach obtains favorable results compared to current approaches across various denoising, deblocking, inpainting, super-resolution and detail enhancement tasks. Code is available at https://github.com/shizenglin/Measure-and-Control-Spectral-Bias. |
 | William Thong, Cees G M Snoek: Diversely-Supervised Visual Product Search. In: ACM Transactions on Multimedia Computing, Communications and Applications, vol. 18, no. 1, pp. 1-22, 2022. @article{ThongTOMM22,
title = {Diversely-Supervised Visual Product Search},
author = {William Thong and Cees G M Snoek},
url = {https://isis-data.science.uva.nl/cgmsnoek/pub/thong-diversely-tomm.pdf
https://doi.org/10.1145/3461646
https://github.com/twuilliam/diverse-search},
year = {2022},
date = {2022-01-01},
urldate = {2022-01-01},
journal = {ACM Transactions on Multimedia Computing, Communications and Applications},
volume = {18},
number = {1},
pages = {1-22},
abstract = {This paper strives for a diversely-supervised visual product search, where queries specify a diverse set of labels to search for. Where previous works have focused on representing attribute, instance or category labels individually, we consider them together to create a diverse set of labels for visually describing products. We learn an embedding from the supervisory signal provided by every label to encode their interrelationships. Once trained, every label has a corresponding visual representation in the embedding space, which is an aggregation of selected items from the training set. At search time, composite query representations retrieve images that match a specific set of diverse labels. We form composite query representations by averaging over the aggregated representations of each diverse label in the specific set. For evaluation, we extend existing product datasets of cars and clothes with a diverse set of labels. Experiments show the benefits of our embedding for diversely-supervised visual product search in seen and unseen product combinations, and for discovering product design styles.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
This paper strives for a diversely-supervised visual product search, where queries specify a diverse set of labels to search for. Where previous works have focused on representing attribute, instance or category labels individually, we consider them together to create a diverse set of labels for visually describing products. We learn an embedding from the supervisory signal provided by every label to encode their interrelationships. Once trained, every label has a corresponding visual representation in the embedding space, which is an aggregation of selected items from the training set. At search time, composite query representations retrieve images that match a specific set of diverse labels. We form composite query representations by averaging over the aggregated representations of each diverse label in the specific set. For evaluation, we extend existing product datasets of cars and clothes with a diverse set of labels. Experiments show the benefits of our embedding for diversely-supervised visual product search in seen and unseen product combinations, and for discovering product design styles. |
2021
|
 | Sander R Klomp, Matthew van Rijn, Rob G J Wijnhoven, Cees G M Snoek, Peter H N de With: Safe Fakes: Evaluating Face Anonymizers for Face Detectors. In: IEEE International Conference on Automatic Face and Gesture Recognition, Jodhpur, India, 2021. @inproceedings{klomp2021safe,
title = {Safe Fakes: Evaluating Face Anonymizers for Face Detectors},
author = {Sander R Klomp and Matthew van Rijn and Rob G J Wijnhoven and Cees G M Snoek and Peter H N de With},
url = {https://isis-data.science.uva.nl/cgmsnoek/pub/klomp-safe-fakes-fg2021.pdf},
year = {2021},
date = {2021-12-15},
urldate = {2021-04-23},
booktitle = {IEEE International Conference on Automatic Face and Gesture Recognition},
address = {Jodhpur, India},
abstract = {Since the introduction of the GDPR and CCPA privacy legislation, both public and private facial image datasets are increasingly scrutinized. Several datasets have been taken offline completely and some have been anonymized. However, it is unclear how anonymization impacts face detection performance. To our knowledge, this paper presents the first empirical study on the effect of image anonymization on supervised training of face detectors. We compare conventional face anonymizers with three state-of-the-art Generative Adversarial Network-based (GAN) methods, by training an off-the-shelf face detector on anonymized data. Our experiments investigate the suitability of anonymization methods for maintaining face detector performance, the effect of detectors overtraining on anonymization artefacts, dataset size for training an anonymizer, and the effect of training time of anonymization GANs. A final experiment investigates the correlation between common GAN evaluation metrics and the performance of a trained face detector. Although all tested anonymization methods lower the performance of trained face detectors, faces anonymized using GANs cause far smaller performance degradation than conventional methods. As the most important finding, the best-performing GAN, DeepPrivacy, removes identifiable faces for a face detector trained on anonymized data, resulting in a modest decrease from 91.0 to 88.3 mAP. In the last few years, there have been rapid improvements in realism of GAN-generated faces. We expect that further progression in GAN research will allow the use of Deep Fake technology for privacy-preserving Safe Fakes, without any performance degradation for training face detectors.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Since the introduction of the GDPR and CCPA privacy legislation, both public and private facial image datasets are increasingly scrutinized. Several datasets have been taken offline completely and some have been anonymized. However, it is unclear how anonymization impacts face detection performance. To our knowledge, this paper presents the first empirical study on the effect of image anonymization on supervised training of face detectors. We compare conventional face anonymizers with three state-of-the-art Generative Adversarial Network-based (GAN) methods, by training an off-the-shelf face detector on anonymized data. Our experiments investigate the suitability of anonymization methods for maintaining face detector performance, the effect of detectors overtraining on anonymization artefacts, dataset size for training an anonymizer, and the effect of training time of anonymization GANs. A final experiment investigates the correlation between common GAN evaluation metrics and the performance of a trained face detector. Although all tested anonymization methods lower the performance of trained face detectors, faces anonymized using GANs cause far smaller performance degradation than conventional methods. As the most important finding, the best-performing GAN, DeepPrivacy, removes identifiable faces for a face detector trained on anonymized data, resulting in a modest decrease from 91.0 to 88.3 mAP. In the last few years, there have been rapid improvements in realism of GAN-generated faces. We expect that further progression in GAN research will allow the use of Deep Fake technology for privacy-preserving Safe Fakes, without any performance degradation for training face detectors. |
 | Shuo Chen, Pascal Mettes, Cees G M Snoek: Diagnosing Errors in Video Relation Detectors. In: BMVC, Virtual, 2021. @inproceedings{ChenBMVC2021,
title = {Diagnosing Errors in Video Relation Detectors},
author = {Shuo Chen and Pascal Mettes and Cees G M Snoek},
url = {https://isis-data.science.uva.nl/cgmsnoek/pub/chen-diagnosing-bmvc2021.pdf
https://github.com/shanshuo/DiagnoseVRD
},
year = {2021},
date = {2021-11-01},
urldate = {2021-11-01},
booktitle = {BMVC},
address = {Virtual},
abstract = {Video relation detection forms a new and challenging problem in computer vision, where subjects and objects need to be localized spatio-temporally and a predicate label needs to be assigned if and only if there is an interaction between the two. Despite recent progress in video relation detection, overall performance is still marginal and it remains unclear what the key factors are towards solving the problem. Following examples set in the object detection and action localization literature, we perform a deep dive into the error diagnosis of current video relation detection approaches. We introduce a diagnostic tool for analyzing the sources of detection errors. Our tool evaluates and compares current approaches beyond the single scalar metric of mean Average Precision by defining different error types specific to video relation detection, used for false positive analyses. Moreover, we examine different factors of influence on the performance in a false negative analysis, including relation length, number of subject/object/predicate instances, and subject/object size. Finally, we present the effect on video relation performance when considering an oracle fix for each error type. On two video relation benchmarks, we show where current approaches excel and fall short, allowing us to pinpoint the most important future directions in the field. The tool is available at https://github.com/shanshuo/DiagnoseVRD.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Video relation detection forms a new and challenging problem in computer vision, where subjects and objects need to be localized spatio-temporally and a predicate label needs to be assigned if and only if there is an interaction between the two. Despite recent progress in video relation detection, overall performance is still marginal and it remains unclear what the key factors are towards solving the problem. Following examples set in the object detection and action localization literature, we perform a deep dive into the error diagnosis of current video relation detection approaches. We introduce a diagnostic tool for analyzing the sources of detection errors. Our tool evaluates and compares current approaches beyond the single scalar metric of mean Average Precision by defining different error types specific to video relation detection, used for false positive analyses. Moreover, we examine different factors of influence on the performance in a false negative analysis, including relation length, number of subject/object/predicate instances, and subject/object size. Finally, we present the effect on video relation performance when considering an oracle fix for each error type. On two video relation benchmarks, we show where current approaches excel and fall short, allowing us to pinpoint the most important future directions in the field. The tool is available at https://github.com/shanshuo/DiagnoseVRD. |
 | William Thong, Cees G M Snoek: Feature and Label Embedding Spaces Matter in Addressing Image Classifier Bias. In: BMVC, Virtual, 2021. @inproceedings{ThongBMVC2021,
title = {Feature and Label Embedding Spaces Matter in Addressing Image Classifier Bias},
author = {William Thong and Cees G M Snoek},
url = {https://isis-data.science.uva.nl/cgmsnoek/pub/thong-image-classifier-bias-bmvc2021.pdf
https://github.com/twuilliam/bias-classifiers},
year = {2021},
date = {2021-11-01},
urldate = {2021-11-01},
booktitle = {BMVC},
address = {Virtual},
abstract = {This paper strives to address image classifier bias, with a focus on both feature and label embedding spaces. Previous works have shown that spurious correlations from protected attributes, such as age, gender, or skin tone, can cause adverse decisions. To balance potential harms, there is a growing need to identify and mitigate image classifier bias. First, we identify in the feature space a bias direction. We compute class prototypes of each protected attribute value for every class, and reveal an existing subspace that captures the maximum variance of the bias. Second, we mitigate biases by mapping image inputs to label embedding spaces. Each value of the protected attribute has its projection head where classes are embedded through a latent vector representation rather than a common one-hot encoding. Once trained, we further reduce in the feature space the bias effect by removing its direction. Evaluation on biased image datasets, for multi-class, multi-label and binary classifications, shows the effectiveness of tackling both feature and label embedding spaces in improving the fairness of the classifier predictions, while preserving classification performance.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper strives to address image classifier bias, with a focus on both feature and label embedding spaces. Previous works have shown that spurious correlations from protected attributes, such as age, gender, or skin tone, can cause adverse decisions. To balance potential harms, there is a growing need to identify and mitigate image classifier bias. First, we identify in the feature space a bias direction. We compute class prototypes of each protected attribute value for every class, and reveal an existing subspace that captures the maximum variance of the bias. Second, we mitigate biases by mapping image inputs to label embedding spaces. Each value of the protected attribute has its projection head where classes are embedded through a latent vector representation rather than a common one-hot encoding. Once trained, we further reduce in the feature space the bias effect by removing its direction. Evaluation on biased image datasets, for multi-class, multi-label and binary classifications, shows the effectiveness of tackling both feature and label embedding spaces in improving the fairness of the classifier predictions, while preserving classification performance. |
 | Fida Mohammad Thoker, Hazel Doughty, Cees G M Snoek
: Skeleton-Contrastive 3D Action Representation Learning. In: MM, Chengdu, China, 2021. @inproceedings{ThokerMM21,
title = {Skeleton-Contrastive 3D Action Representation Learning},
author = {Fida Mohammad Thoker and Hazel Doughty and Cees G M Snoek
},
url = {https://arxiv.org/abs/2108.03656
https://github.com/fmthoker/skeleton-contrast},
year = {2021},
date = {2021-10-20},
booktitle = {MM},
address = {Chengdu, China},
abstract = {This paper strives for self-supervised learning of a feature space suitable for skeleton-based action recognition. Our proposal is built upon learning invariances to input skeleton representations and various skeleton augmentations via a noise contrastive estimation. In particular, we propose inter-skeleton contrastive learning, which learns from multiple different input skeleton representations in a cross-contrastive manner. In addition, we contribute several skeleton-specific spatial and temporal augmentations which further encourage the model to learn the spatio-temporal dynamics of skeleton data. By learning similarities between different skeleton representations as well as augmented views of the same sequence, the network is encouraged to learn higher-level semantics of the skeleton data than when only using the augmented views. Our approach achieves state-of-the-art performance for self-supervised learning from skeleton data on the challenging PKU and NTU datasets with multiple downstream tasks, including action recognition, action retrieval and semi-supervised learning.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper strives for self-supervised learning of a feature space suitable for skeleton-based action recognition. Our proposal is built upon learning invariances to input skeleton representations and various skeleton augmentations via a noise contrastive estimation. In particular, we propose inter-skeleton contrastive learning, which learns from multiple different input skeleton representations in a cross-contrastive manner. In addition, we contribute several skeleton-specific spatial and temporal augmentations which further encourage the model to learn the spatio-temporal dynamics of skeleton data. By learning similarities between different skeleton representations as well as augmented views of the same sequence, the network is encouraged to learn higher-level semantics of the skeleton data than when only using the augmented views. Our approach achieves state-of-the-art performance for self-supervised learning from skeleton data on the challenging PKU and NTU datasets with multiple downstream tasks, including action recognition, action retrieval and semi-supervised learning. |