2022
|
| David W Zhang, Gertjan J Burghouts, Cees G M Snoek: Pruning Edges and Gradients to Learn Hypergraphs from Larger Sets. In: LoG, 2022. @inproceedings{ZhangLOG2022,
title = {Pruning Edges and Gradients to Learn Hypergraphs from Larger Sets},
author = {David W Zhang and Gertjan J Burghouts and Cees G M Snoek},
url = {https://isis-data.science.uva.nl/cgmsnoek/pub/zhang-hypergraphs-log2022.pdf
https://github.com/davzha/recurrently_predicting_hypergraphs},
year = {2022},
date = {2022-12-09},
urldate = {2022-12-09},
booktitle = {LoG},
abstract = {This paper aims for set-to-hypergraph prediction, where the goal is to infer the set of relations for a given set of entities. This is a common abstraction for applications in particle physics, biological systems and combinatorial optimization. We address two common scaling problems encountered in set-to-hypergraph tasks that limit the size of the input set: the exponentially growing number of hyperedges and the run-time complexity, both leading to higher memory requirements. We make three contributions. First, we propose to predict and supervise the positive edges only, which changes the asymptotic memory scaling from exponential to linear. Second, we introduce a training method that encourages iterative refinement of the predicted hypergraph, which allows us to skip iterations in the backward pass for improved efficiency and constant memory usage. Third, we combine both contributions in a single set-to-hypergraph model that enables us to address problems with larger input set sizes. We provide ablations for our main technical contributions and show that our model outperforms prior state-of-the-art, especially for larger sets.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper aims for set-to-hypergraph prediction, where the goal is to infer the set of relations for a given set of entities. This is a common abstraction for applications in particle physics, biological systems and combinatorial optimization. We address two common scaling problems encountered in set-to-hypergraph tasks that limit the size of the input set: the exponentially growing number of hyperedges and the run-time complexity, both leading to higher memory requirements. We make three contributions. First, we propose to predict and supervise the positive edges only, which changes the asymptotic memory scaling from exponential to linear. Second, we introduce a training method that encourages iterative refinement of the predicted hypergraph, which allows us to skip iterations in the backward pass for improved efficiency and constant memory usage. Third, we combine both contributions in a single set-to-hypergraph model that enables us to address problems with larger input set sizes. We provide ablations for our main technical contributions and show that our model outperforms prior state-of-the-art, especially for larger sets. |
| Mengmeng Jing, Xiantong Zhen, Jingjing Li, Cees G. M. Snoek: Variational Model Perturbation for Source-Free Domain Adaptation. In: NeurIPS, 2022. @inproceedings{JingNeurIPS2022,
title = {Variational Model Perturbation for Source-Free Domain Adaptation},
author = {Mengmeng Jing and Xiantong Zhen and Jingjing Li and Cees G. M. Snoek},
url = {https://github.com/mmjing/Variational_Model_Perturbation
https://arxiv.org/abs/2210.10378},
year = {2022},
date = {2022-12-01},
urldate = {2022-12-01},
booktitle = {NeurIPS},
abstract = {We aim for source-free domain adaptation, where the task is to deploy a model pre-trained on source domains to target domains. The challenges stem from the distribution shift from the source to the target domain, coupled with the unavailability of any source data and labeled target data for optimization. Rather than fine-tuning the model by updating the parameters, we propose to perturb the source model to achieve adaptation to target domains. We introduce perturbations into the model parameters by variational Bayesian inference in a probabilistic framework. By doing so, we can effectively adapt the model to the target domain while largely preserving the discriminative ability. Importantly, we demonstrate the theoretical connection to learning Bayesian neural networks, which proves the generalizability of the perturbed model to target domains. To enable more efficient optimization, we further employ a parameter sharing strategy, which substantially reduces the learnable parameters compared to a fully Bayesian neural network. Our model perturbation provides a new probabilistic way for domain adaptation which enables efficient adaptation to target domains while maximally preserving knowledge in source models. Experiments on several source-free benchmarks under three different evaluation settings verify the effectiveness of the proposed variational model perturbation for source-free domain adaptation.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We aim for source-free domain adaptation, where the task is to deploy a model pre-trained on source domains to target domains. The challenges stem from the distribution shift from the source to the target domain, coupled with the unavailability of any source data and labeled target data for optimization. Rather than fine-tuning the model by updating the parameters, we propose to perturb the source model to achieve adaptation to target domains. We introduce perturbations into the model parameters by variational Bayesian inference in a probabilistic framework. By doing so, we can effectively adapt the model to the target domain while largely preserving the discriminative ability. Importantly, we demonstrate the theoretical connection to learning Bayesian neural networks, which proves the generalizability of the perturbed model to target domains. To enable more efficient optimization, we further employ a parameter sharing strategy, which substantially reduces the learnable parameters compared to a fully Bayesian neural network. Our model perturbation provides a new probabilistic way for domain adaptation which enables efficient adaptation to target domains while maximally preserving knowledge in source models. Experiments on several source-free benchmarks under three different evaluation settings verify the effectiveness of the proposed variational model perturbation for source-free domain adaptation. |
| Jiayi Shen, Zehao Xiao, Xiantong Zhen, Cees G. M. Snoek, Marcel Worring: Association Graph Learning for Multi-Task Classification with Category Shifts. In: NeurIPS, 2022. @inproceedings{ShenNeurIPS2022,
title = {Association Graph Learning for Multi-Task Classification with Category Shifts},
author = {Jiayi Shen and Zehao Xiao and Xiantong Zhen and Cees G. M. Snoek and Marcel Worring},
url = {https://arxiv.org/abs/2210.04637
https://github.com/autumn9999/MTC-with-Category-Shifts.git},
year = {2022},
date = {2022-12-01},
urldate = {2022-12-01},
booktitle = {NeurIPS},
abstract = {In this paper, we focus on multi-task classification, where related classification tasks share the same label space and are learned simultaneously. In particular, we tackle a new setting, which is more realistic than currently addressed in the literature, where categories shift from training to test data. Hence, individual tasks do not contain complete training data for the categories in the test set. To generalize to such test data, it is crucial for individual tasks to leverage knowledge from related tasks. To this end, we propose learning an association graph to transfer knowledge among tasks for missing classes. We construct the association graph with nodes representing tasks, classes and instances, and encode the relationships among the nodes in the edges to guide their mutual knowledge transfer. By message passing on the association graph, our model enhances the categorical information of each instance, making it more discriminative. To avoid spurious correlations between task and class nodes in the graph, we introduce an assignment entropy maximization that encourages each class node to balance its edge weights. This enables all tasks to fully utilize the categorical information from related tasks. An extensive evaluation on three general benchmarks and a medical dataset for skin lesion classification reveals that our method consistently performs better than representative baselines.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper, we focus on multi-task classification, where related classification tasks share the same label space and are learned simultaneously. In particular, we tackle a new setting, which is more realistic than currently addressed in the literature, where categories shift from training to test data. Hence, individual tasks do not contain complete training data for the categories in the test set. To generalize to such test data, it is crucial for individual tasks to leverage knowledge from related tasks. To this end, we propose learning an association graph to transfer knowledge among tasks for missing classes. We construct the association graph with nodes representing tasks, classes and instances, and encode the relationships among the nodes in the edges to guide their mutual knowledge transfer. By message passing on the association graph, our model enhances the categorical information of each instance, making it more discriminative. To avoid spurious correlations between task and class nodes in the graph, we introduce an assignment entropy maximization that encourages each class node to balance its edge weights. This enables all tasks to fully utilize the categorical information from related tasks. An extensive evaluation on three general benchmarks and a medical dataset for skin lesion classification reveals that our method consistently performs better than representative baselines. |
| Fida Mohammad Thoker, Hazel Doughty, Piyush Bagad, Cees G M Snoek: How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?. In: ECCV, 2022. @inproceedings{ThokerECCV2022,
title = {How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?},
author = {Fida Mohammad Thoker and Hazel Doughty and Piyush Bagad and Cees G M Snoek},
url = {https://arxiv.org/abs/2203.14221
https://bpiyush.github.io/SEVERE-website/
https://github.com/fmthoker/SEVERE-BENCHMARK},
year = {2022},
date = {2022-10-24},
urldate = {2022-10-24},
booktitle = {ECCV},
abstract = {Despite the recent success of video self-supervised learning, there is much still to be understood about their generalization capability. In this paper, we investigate how sensitive video self-supervised learning is to the currently used benchmark convention and whether methods generalize beyond the canonical evaluation setting. We do this across four different factors of sensitivity: domain, samples, actions and task. Our comprehensive set of over 500 experiments, which encompasses 7 video datasets, 9 self-supervised methods and 6 video understanding tasks, reveals that current benchmarks in video self-supervised learning are not a good indicator of generalization along these sensitivity factors. Further, we find that self-supervised methods considerably lag behind vanilla supervised pre-training, especially when domain shift is large and the amount of available downstream samples are low. From our analysis we distill the SEVERE-benchmark, a subset of our experiments, and discuss its implication for evaluating the generalizability of representations obtained by existing and future self-supervised video learning methods.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Despite the recent success of video self-supervised learning, there is much still to be understood about their generalization capability. In this paper, we investigate how sensitive video self-supervised learning is to the currently used benchmark convention and whether methods generalize beyond the canonical evaluation setting. We do this across four different factors of sensitivity: domain, samples, actions and task. Our comprehensive set of over 500 experiments, which encompasses 7 video datasets, 9 self-supervised methods and 6 video understanding tasks, reveals that current benchmarks in video self-supervised learning are not a good indicator of generalization along these sensitivity factors. Further, we find that self-supervised methods considerably lag behind vanilla supervised pre-training, especially when domain shift is large and the amount of available downstream samples are low. From our analysis we distill the SEVERE-benchmark, a subset of our experiments, and discuss its implication for evaluating the generalizability of representations obtained by existing and future self-supervised video learning methods. |
| Pengwan Yang, Yuki M Asano, Pascal Mettes, Cees G M Snoek: Less than Few: Self-Shot Video Instance Segmentation. In: ECCV, 2022. @inproceedings{YangECCV22,
title = {Less than Few: Self-Shot Video Instance Segmentation},
author = {Pengwan Yang and Yuki M Asano and Pascal Mettes and Cees G M Snoek},
url = {https://isis-data.science.uva.nl/cgmsnoek/pub/yang-selfshot-eccv2022.pdf
https://github.com/PengWan-Yang/self-shot},
year = {2022},
date = {2022-10-24},
urldate = {2022-10-24},
booktitle = {ECCV},
abstract = {The goal of this paper is to bypass the need for labelled examples in few-shot video understanding at run time. While proven effective, in many practical video settings even labelling a few examples appears unrealistic. This is especially true as the level of details in spatio-temporal video understanding and with it, the complexity of annotations continues to increase. Rather than performing few-shot learning with a human oracle to provide a few densely labelled support videos, we propose to automatically learn to find appropriate support videos given a query. We call this self-shot learning and we outline a simple self-supervised learning method to generate an embedding space well-suited for unsupervised retrieval of relevant samples. To showcase this novel setting, we tackle, for the first time, video instance segmentation in a self-shot (and few-shot) setting, where the goal is to segment instances at the pixel-level across the spatial and temporal domains. We provide strong baseline performances that utilize a novel transformer-based model and show that self-shot learning can even surpass few-shot and can be positively combined for further performance gains. Experiments on new benchmarks show that our approach achieves strong performance, is competitive to oracle support in some settings, scales to large unlabelled video collections, and can be combined in a semi-supervised setting.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
The goal of this paper is to bypass the need for labelled examples in few-shot video understanding at run time. While proven effective, in many practical video settings even labelling a few examples appears unrealistic. This is especially true as the level of details in spatio-temporal video understanding and with it, the complexity of annotations continues to increase. Rather than performing few-shot learning with a human oracle to provide a few densely labelled support videos, we propose to automatically learn to find appropriate support videos given a query. We call this self-shot learning and we outline a simple self-supervised learning method to generate an embedding space well-suited for unsupervised retrieval of relevant samples. To showcase this novel setting, we tackle, for the first time, video instance segmentation in a self-shot (and few-shot) setting, where the goal is to segment instances at the pixel-level across the spatial and temporal domains. We provide strong baseline performances that utilize a novel transformer-based model and show that self-shot learning can even surpass few-shot and can be positively combined for further performance gains. Experiments on new benchmarks show that our approach achieves strong performance, is competitive to oracle support in some settings, scales to large unlabelled video collections, and can be combined in a semi-supervised setting. |
| Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Tom van Sonsbeek, Xiantong Zhen, Dwarikanath Mahapatra, Marcel Worring, Cees G M Snoek: LifeLonger: A Benchmark for Continual Disease Classification. In: MICCAI, Singapore, 2022. @inproceedings{DerakhshaniMICCAI2022,
title = {LifeLonger: A Benchmark for Continual Disease Classification},
author = {Mohammad Mahdi Derakhshani and Ivona Najdenkoska and Tom van Sonsbeek and Xiantong Zhen and Dwarikanath Mahapatra and Marcel Worring and Cees G M Snoek},
url = {https://arxiv.org/abs/2204.05737
https://github.com/mmderakhshani/LifeLonger},
year = {2022},
date = {2022-09-18},
urldate = {2022-09-18},
booktitle = {MICCAI},
address = {Singapore},
abstract = {Deep learning models have shown a great effectiveness in recognition of findings in medical images. However, they cannot handle the ever-changing clinical environment, bringing newly annotated medical data from different sources. To exploit the incoming streams of data, these models would benefit largely from sequentially learning from new samples, without forgetting the previously obtained knowledge.
In this paper we introduce LifeLonger, a benchmark for continual disease classification on the MedMNIST collection, by applying existing state-of-the-art continual learning methods. In particular, we consider three continual learning scenarios, namely, task and class incremental learning and the newly defined cross-domain incremental learning. Task and class incremental learning of diseases address the issue of classifying new samples without re-training the models from scratch, while cross-domain incremental learning addresses the issue of dealing with datasets originating from different institutions while retaining the previously obtained knowledge. We perform a thorough analysis of the performance and examine how the well-known challenges of continual learning, such as the catastrophic forgetting exhibit themselves in this setting. The encouraging results demonstrate that continual learning has a major potential to advance disease classification and to produce a more robust and efficient learning framework for clinical settings. The code repository, data partitions and baseline results for the complete benchmark are publicly available.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Deep learning models have shown a great effectiveness in recognition of findings in medical images. However, they cannot handle the ever-changing clinical environment, bringing newly annotated medical data from different sources. To exploit the incoming streams of data, these models would benefit largely from sequentially learning from new samples, without forgetting the previously obtained knowledge.
In this paper we introduce LifeLonger, a benchmark for continual disease classification on the MedMNIST collection, by applying existing state-of-the-art continual learning methods. In particular, we consider three continual learning scenarios, namely, task and class incremental learning and the newly defined cross-domain incremental learning. Task and class incremental learning of diseases address the issue of classifying new samples without re-training the models from scratch, while cross-domain incremental learning addresses the issue of dealing with datasets originating from different institutions while retaining the previously obtained knowledge. We perform a thorough analysis of the performance and examine how the well-known challenges of continual learning, such as the catastrophic forgetting exhibit themselves in this setting. The encouraging results demonstrate that continual learning has a major potential to advance disease classification and to produce a more robust and efficient learning framework for clinical settings. The code repository, data partitions and baseline results for the complete benchmark are publicly available. |
| Hazel Doughty, Cees G M Snoek: How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs. In: CVPR, 2022. @inproceedings{DoughtyCVPR2022,
title = {How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs},
author = {Hazel Doughty and Cees G M Snoek},
url = {https://hazeldoughty.github.io/Papers/PseudoAdverbs/PseudoAdverbs.pdf
https://hazeldoughty.github.io/Papers/PseudoAdverbs/
https://github.com/hazeld/pseudoadverbs},
year = {2022},
date = {2022-06-03},
urldate = {2022-06-03},
booktitle = {CVPR},
abstract = {We aim to understand how actions are performed and identify subtle differences, such as `fold firmly' vs. `fold gently'. To this end, we propose a method which recognizes adverbs across different actions. However, such fine-grained annotations are difficult to obtain and their long-tailed nature makes it challenging to recognize adverbs in rare action-adverb compositions. Our approach therefore uses semi-supervised learning with multiple adverb pseudo-labels to leverage videos with only action labels. Combined with adaptive thresholding of these pseudo-adverbs we are able to make efficient use of the available data while tackling the long-tailed distribution. Additionally, we gather adverb annotations for three existing video retrieval datasets, which allows us to introduce the new tasks of recognizing adverbs in unseen action-adverb compositions and unseen domains. Experiments demonstrate the effectiveness of our method, which outperforms prior work in recognizing adverbs and semi-supervised works adapted for adverb recognition. We also show how adverbs can relate fine-grained actions.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We aim to understand how actions are performed and identify subtle differences, such as `fold firmly' vs. `fold gently'. To this end, we propose a method which recognizes adverbs across different actions. However, such fine-grained annotations are difficult to obtain and their long-tailed nature makes it challenging to recognize adverbs in rare action-adverb compositions. Our approach therefore uses semi-supervised learning with multiple adverb pseudo-labels to leverage videos with only action labels. Combined with adaptive thresholding of these pseudo-adverbs we are able to make efficient use of the available data while tackling the long-tailed distribution. Additionally, we gather adverb annotations for three existing video retrieval datasets, which allows us to introduce the new tasks of recognizing adverbs in unseen action-adverb compositions and unseen domains. Experiments demonstrate the effectiveness of our method, which outperforms prior work in recognizing adverbs and semi-supervised works adapted for adverb recognition. We also show how adverbs can relate fine-grained actions. |
| Duy-Kien Nguyen, Jihong Ju, Olaf Booij, Martin R Oswald, Cees G M Snoek: BoxeR: Box-Attention for 2D and 3D Transformers. In: CVPR, 2022. @inproceedings{NguyenCVPR2022,
title = {BoxeR: Box-Attention for 2D and 3D Transformers},
author = {Duy-Kien Nguyen and Jihong Ju and Olaf Booij and Martin R Oswald and Cees G M Snoek},
url = {https://arxiv.org/abs/2111.13087
https://github.com/kienduynguyen/BoxeR},
year = {2022},
date = {2022-06-02},
urldate = {2022-06-02},
booktitle = {CVPR},
abstract = {In this paper, we propose a simple attention mechanism, we call Box-Attention. It enables spatial interaction between grid features, as sampled from boxes of interest, and improves the learning capability of transformers for several vision tasks. Specifically, we present BoxeR, short for Box Transformer, which attends to a set of boxes by predicting their transformation from a reference window on an input feature map. The BoxeR computes attention weights on these boxes by considering its grid structure. Notably, BoxeR-2D naturally reasons about box information within its attention module, making it suitable for end-to-end instance detection and segmentation tasks. By learning invariance to rotation in the box-attention module, BoxeR-3D is capable of generating discriminative information from a bird-eye-view plane for 3D end-to-end object detection. Our experiments demonstrate that the proposed BoxeR-2D achieves better results on COCO detection, and reaches comparable performance with well-established and highly-optimized Mask R-CNN on COCO instance segmentation. BoxeR-3D already obtains a compelling performance for the vehicle category of Waymo Open, without any class-specific optimization. The code will be released.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper, we propose a simple attention mechanism, we call Box-Attention. It enables spatial interaction between grid features, as sampled from boxes of interest, and improves the learning capability of transformers for several vision tasks. Specifically, we present BoxeR, short for Box Transformer, which attends to a set of boxes by predicting their transformation from a reference window on an input feature map. The BoxeR computes attention weights on these boxes by considering its grid structure. Notably, BoxeR-2D naturally reasons about box information within its attention module, making it suitable for end-to-end instance detection and segmentation tasks. By learning invariance to rotation in the box-attention module, BoxeR-3D is capable of generating discriminative information from a bird-eye-view plane for 3D end-to-end object detection. Our experiments demonstrate that the proposed BoxeR-2D achieves better results on COCO detection, and reaches comparable performance with well-established and highly-optimized Mask R-CNN on COCO instance segmentation. BoxeR-3D already obtains a compelling performance for the vehicle category of Waymo Open, without any class-specific optimization. The code will be released. |
| Yunhua Zhang, Hazel Doughty, Ling Shao, Cees G M Snoek: Audio-Adaptive Activity Recognition Across Video Domains. In: CVPR, 2022. @inproceedings{ZhangCVPR2022,
title = {Audio-Adaptive Activity Recognition Across Video Domains},
author = {Yunhua Zhang and Hazel Doughty and Ling Shao and Cees G M Snoek},
url = {https://arxiv.org/abs/2203.14240
https://xiaobai1217.github.io/DomainAdaptation/
https://github.com/xiaobai1217/DomainAdaptation
},
year = {2022},
date = {2022-06-02},
urldate = {2022-06-02},
booktitle = {CVPR},
abstract = {This paper strives for activity recognition under domain shift, for example caused by change of scenery or camera viewpoint. The leading approaches reduce the shift in activity appearance by adversarial training and self-supervised learning. Different from these vision-focused works we leverage activity sounds for domain adaptation as they have less variance across domains and can reliably indicate which activities are not happening. We propose an audio-adaptive encoder and associated learning methods that discriminatively adjust the visual feature representation as well as addressing shifts in the semantic distribution. To further eliminate domain-specific features and include domain-invariant activity sounds for recognition, an audio-infused recognizer is proposed, which effectively models the cross-modal interaction across domains. We also introduce the new task of actor shift, with a corresponding audio-visual dataset, to challenge our method with situations where the activity appearance changes dramatically. Experiments on this dataset, EPIC-Kitchens and CharadesEgo show the effectiveness of our approach.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper strives for activity recognition under domain shift, for example caused by change of scenery or camera viewpoint. The leading approaches reduce the shift in activity appearance by adversarial training and self-supervised learning. Different from these vision-focused works we leverage activity sounds for domain adaptation as they have less variance across domains and can reliably indicate which activities are not happening. We propose an audio-adaptive encoder and associated learning methods that discriminatively adjust the visual feature representation as well as addressing shifts in the semantic distribution. To further eliminate domain-specific features and include domain-invariant activity sounds for recognition, an audio-infused recognizer is proposed, which effectively models the cross-modal interaction across domains. We also introduce the new task of actor shift, with a corresponding audio-visual dataset, to challenge our method with situations where the activity appearance changes dramatically. Experiments on this dataset, EPIC-Kitchens and CharadesEgo show the effectiveness of our approach. |
| Jiaojiao Zhao, Yanyi Zhang, Xinyu Li, Hao Chen, Shuai Bing, Mingze Xu, Chunhui Liu, Kaustav Kundu, Yuanjun Xiong, Davide Modolo, Ivan Marsic, Cees G M Snoek, Joseph Tighe: TubeR: Tubelet Transformer for Video Action Detection. In: CVPR, 2022, (Oral presentation, top 4.2%). @inproceedings{ZhaoCVPR2022,
title = {TubeR: Tubelet Transformer for Video Action Detection},
author = {Jiaojiao Zhao and Yanyi Zhang and Xinyu Li and Hao Chen and Shuai Bing and Mingze Xu and Chunhui Liu and Kaustav Kundu and Yuanjun Xiong and Davide Modolo and Ivan Marsic and Cees G M Snoek and Joseph Tighe},
url = {https://arxiv.org/abs/2104.00969
https://github.com/amazon-research/tubelet-transformer},
year = {2022},
date = {2022-06-01},
urldate = {2022-06-01},
booktitle = {CVPR},
abstract = {We propose TubeR: a simple solution for spatio-temporal video action detection. Different from existing methods that depend on either an off-line actor detector or hand-designed actor-positional hypotheses like proposals or anchors, we propose to directly detect an action tubelet in a video by simultaneously performing action localization and recognition from a single representation. TubeR learns a set of tubelet- queries and utilizes a tubelet-attention module to model the dynamic spatio-temporal nature of a video clip, which effectively reinforces the model capacity compared to using actor-positional hypotheses in the spatio-temporal space. For videos containing transitional states or scene changes, we propose a context aware classification head to utilize short-term and long-term context to strengthen action classification, and an action switch regression head for detecting the precise temporal action extent. TubeR directly produces action tubelets with variable lengths and even maintains good results for long video clips. TubeR outperforms the previous state-of-the-art on commonly used action detection datasets AVA, UCF101-24 and JHMDB51-21. },
note = {Oral presentation, top 4.2%},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We propose TubeR: a simple solution for spatio-temporal video action detection. Different from existing methods that depend on either an off-line actor detector or hand-designed actor-positional hypotheses like proposals or anchors, we propose to directly detect an action tubelet in a video by simultaneously performing action localization and recognition from a single representation. TubeR learns a set of tubelet- queries and utilizes a tubelet-attention module to model the dynamic spatio-temporal nature of a video clip, which effectively reinforces the model capacity compared to using actor-positional hypotheses in the spatio-temporal space. For videos containing transitional states or scene changes, we propose a context aware classification head to utilize short-term and long-term context to strengthen action classification, and an action switch regression head for detecting the precise temporal action extent. TubeR directly produces action tubelets with variable lengths and even maintains good results for long video clips. TubeR outperforms the previous state-of-the-art on commonly used action detection datasets AVA, UCF101-24 and JHMDB51-21. |
| Yingjun Du, Xiantong Zhen, Ling Shao, Cees G M Snoek: Hierarchical Variational Memory for Few-shot Learning Across Domains. In: ICLR, Virtual, 2022. @inproceedings{DuICLR2022,
title = {Hierarchical Variational Memory for Few-shot Learning Across Domains},
author = {Yingjun Du and Xiantong Zhen and Ling Shao and Cees G M Snoek},
url = {https://arxiv.org/abs/2112.08181},
year = {2022},
date = {2022-04-25},
urldate = {2022-04-25},
booktitle = {ICLR},
address = {Virtual},
abstract = {Neural memory enables fast adaptation to new tasks with just a few training samples. Existing memory models store features only from the single last layer, which does not generalize well in presence of a domain shift between training and test distributions. Rather than relying on a flat memory, we propose a hierarchical alternative that stores features at different semantic levels. We introduce a hierarchical prototype model, where each level of the prototype fetches corresponding information from the hierarchical memory. The model is endowed with the ability to flexibly rely on features at different semantic levels if the domain shift circumstances so demand. We meta-learn the model by a newly derived hierarchical variational inference framework, where hierarchical memory and prototypes are jointly optimized. To explore and exploit the importance of different semantic levels, we further propose to learn the weights associated with the prototype at each level in a data-driven way, which enables the model to adaptively choose the most generalizable features. We conduct thorough ablation studies to demonstrate the effectiveness of each component in our model. The new state-of-the-art performance on cross-domain and competitive performance on traditional few-shot classification further substantiates the benefit of hierarchical variational memory.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Neural memory enables fast adaptation to new tasks with just a few training samples. Existing memory models store features only from the single last layer, which does not generalize well in presence of a domain shift between training and test distributions. Rather than relying on a flat memory, we propose a hierarchical alternative that stores features at different semantic levels. We introduce a hierarchical prototype model, where each level of the prototype fetches corresponding information from the hierarchical memory. The model is endowed with the ability to flexibly rely on features at different semantic levels if the domain shift circumstances so demand. We meta-learn the model by a newly derived hierarchical variational inference framework, where hierarchical memory and prototypes are jointly optimized. To explore and exploit the importance of different semantic levels, we further propose to learn the weights associated with the prototype at each level in a data-driven way, which enables the model to adaptively choose the most generalizable features. We conduct thorough ablation studies to demonstrate the effectiveness of each component in our model. The new state-of-the-art performance on cross-domain and competitive performance on traditional few-shot classification further substantiates the benefit of hierarchical variational memory. |
| Zehao Xiao, Xiantong Zhen, Ling Shao, Cees G M Snoek: Learning to Generalize across Domains on Single Test Samples. In: ICLR, Virtual, 2022. @inproceedings{XiaoICLR2022,
title = {Learning to Generalize across Domains on Single Test Samples},
author = {Zehao Xiao and Xiantong Zhen and Ling Shao and Cees G M Snoek},
url = {https://arxiv.org/abs/2202.08045
https://github.com/zzzx1224/SingleSampleGeneralization-ICLR2022},
year = {2022},
date = {2022-04-25},
urldate = {2022-04-25},
booktitle = {ICLR},
address = {Virtual},
abstract = {We strive to learn a model from a set of source domains that generalizes well to unseen target domains. The main challenge in such a domain generalization scenario is the unavailability of any target domain data during training, resulting in the learned model not being explicitly adapted to the unseen target domains. We propose learning to generalize across domains on single test samples. We leverage a meta-learning paradigm to learn our model to acquire the ability of adaptation with single samples at training time so as to further adapt itself to each single test sample at test time. We formulate the adaptation to the single test sample as a variational Bayesian inference problem, which incorporates the test sample as a conditional into the generation of model parameters. The adaptation to each test sample requires only one feed-forward computation at test time without any fine-tuning or self-supervised training on additional data from the unseen domains. Extensive ablation studies demonstrate that our model learns the ability to adapt models by mimicking domain shift during training. Further, our model achieves at least comparable -- and often better -- performance than state-of-the-art methods on multiple benchmarks for domain generalization},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We strive to learn a model from a set of source domains that generalizes well to unseen target domains. The main challenge in such a domain generalization scenario is the unavailability of any target domain data during training, resulting in the learned model not being explicitly adapted to the unseen target domains. We propose learning to generalize across domains on single test samples. We leverage a meta-learning paradigm to learn our model to acquire the ability of adaptation with single samples at training time so as to further adapt itself to each single test sample at test time. We formulate the adaptation to the single test sample as a variational Bayesian inference problem, which incorporates the test sample as a conditional into the generation of model parameters. The adaptation to each test sample requires only one feed-forward computation at test time without any fine-tuning or self-supervised training on additional data from the unseen domains. Extensive ablation studies demonstrate that our model learns the ability to adapt models by mimicking domain shift during training. Further, our model achieves at least comparable -- and often better -- performance than state-of-the-art methods on multiple benchmarks for domain generalization |
| Yan Zhang, David W Zhang, Simon Lacoste-Julien, Gertjan J Burghouts, Cees G M Snoek: Multiset-Equivariant Set Prediction with Approximate Implicit Differentiation. In: ICLR, Virtual, 2022. @inproceedings{YanZhangICLR2022,
title = {Multiset-Equivariant Set Prediction with Approximate Implicit Differentiation},
author = {Yan Zhang and David W Zhang and Simon Lacoste-Julien and Gertjan J Burghouts and Cees G M Snoek},
url = {https://arxiv.org/abs/2111.12193
https://www.youtube.com/watch?v=xfVBZprO7g8
https://github.com/davzha/multiset-equivariance},
year = {2022},
date = {2022-04-24},
urldate = {2022-04-01},
booktitle = {ICLR},
address = {Virtual},
abstract = {Most set prediction models in deep learning use set-equivariant operations, but they actually operate on multisets. We show that set-equivariant functions cannot represent certain functions on multisets, so we introduce the more appropriate notion of multiset-equivariance. We identify that the existing Deep Set Prediction Network (DSPN) can be multiset-equivariant without being hindered by set-equivariance and improve it with approximate implicit differentiation, allowing for better optimization while being faster and saving memory. In a range of toy experiments, we show that the perspective of multiset-equivariance is beneficial and that our changes to DSPN achieve better results in most cases. On CLEVR object property prediction, we substantially improve over the state-of-the-art Slot Attention from 8% to 77% in one of the strictest evaluation metrics because of the benefits made possible by implicit differentiation.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Most set prediction models in deep learning use set-equivariant operations, but they actually operate on multisets. We show that set-equivariant functions cannot represent certain functions on multisets, so we introduce the more appropriate notion of multiset-equivariance. We identify that the existing Deep Set Prediction Network (DSPN) can be multiset-equivariant without being hindered by set-equivariance and improve it with approximate implicit differentiation, allowing for better optimization while being faster and saving memory. In a range of toy experiments, we show that the perspective of multiset-equivariance is beneficial and that our changes to DSPN achieve better results in most cases. On CLEVR object property prediction, we substantially improve over the state-of-the-art Slot Attention from 8% to 77% in one of the strictest evaluation metrics because of the benefits made possible by implicit differentiation. |
| Zenglin Shi, Pascal Mettes, Subhransu Maji, Cees G M Snoek: On Measuring and Controlling the Spectral Bias of the Deep Image Prior. In: International Journal of Computer Vision, vol. 130, pp. 885–908, 2022. @article{ShiIJCV22,
title = {On Measuring and Controlling the Spectral Bias of the Deep Image Prior},
author = {Zenglin Shi and Pascal Mettes and Subhransu Maji and Cees G M Snoek},
url = {https://arxiv.org/abs/2107.01125
https://link.springer.com/article/10.1007/s11263-021-01572-7
https://github.com/shizenglin/Measure-and-Control-Spectral-Bias},
year = {2022},
date = {2022-04-01},
urldate = {2022-02-11},
journal = {International Journal of Computer Vision},
volume = {130},
pages = {885–908},
abstract = {The deep image prior showed that a randomly initialized network with a suitable architecture can be trained to solve inverse imaging problems by simply optimizing it's parameters to reconstruct a single degraded image. However, it suffers from two practical limitations. First, it remains unclear how to control the prior beyond the choice of the network architecture. Second, training requires an oracle stopping criterion as during the optimization the performance degrades after reaching an optimum value. To address these challenges we introduce a frequency-band correspondence measure to characterize the spectral bias of the deep image prior, where low-frequency image signals are learned faster and better than high-frequency counterparts. Based on our observations, we propose techniques to prevent the eventual performance degradation and accelerate convergence. We introduce a Lipschitz-controlled convolution layer and a Gaussian-controlled upsampling layer as plug-in replacements for layers used in the deep architectures. The experiments show that with these changes the performance does not degrade during optimization, relieving us from the need for an oracle stopping criterion. We further outline a stopping criterion to avoid superfluous computation. Finally, we show that our approach obtains favorable results compared to current approaches across various denoising, deblocking, inpainting, super-resolution and detail enhancement tasks. Code is available at https://github.com/shizenglin/Measure-and-Control-Spectral-Bias.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
The deep image prior showed that a randomly initialized network with a suitable architecture can be trained to solve inverse imaging problems by simply optimizing it's parameters to reconstruct a single degraded image. However, it suffers from two practical limitations. First, it remains unclear how to control the prior beyond the choice of the network architecture. Second, training requires an oracle stopping criterion as during the optimization the performance degrades after reaching an optimum value. To address these challenges we introduce a frequency-band correspondence measure to characterize the spectral bias of the deep image prior, where low-frequency image signals are learned faster and better than high-frequency counterparts. Based on our observations, we propose techniques to prevent the eventual performance degradation and accelerate convergence. We introduce a Lipschitz-controlled convolution layer and a Gaussian-controlled upsampling layer as plug-in replacements for layers used in the deep architectures. The experiments show that with these changes the performance does not degrade during optimization, relieving us from the need for an oracle stopping criterion. We further outline a stopping criterion to avoid superfluous computation. Finally, we show that our approach obtains favorable results compared to current approaches across various denoising, deblocking, inpainting, super-resolution and detail enhancement tasks. Code is available at https://github.com/shizenglin/Measure-and-Control-Spectral-Bias. |
| William Thong, Cees G M Snoek: Diversely-Supervised Visual Product Search. In: ACM Transactions on Multimedia Computing, Communications and Applications, vol. 18, no. 1, pp. 1-22, 2022. @article{ThongTOMM22,
title = {Diversely-Supervised Visual Product Search},
author = {William Thong and Cees G M Snoek},
url = {https://isis-data.science.uva.nl/cgmsnoek/pub/thong-diversely-tomm.pdf
https://doi.org/10.1145/3461646
https://github.com/twuilliam/diverse-search},
year = {2022},
date = {2022-01-01},
urldate = {2022-01-01},
journal = {ACM Transactions on Multimedia Computing, Communications and Applications},
volume = {18},
number = {1},
pages = {1-22},
abstract = {This paper strives for a diversely-supervised visual product search, where queries specify a diverse set of labels to search for. Where previous works have focused on representing attribute, instance or category labels individually, we consider them together to create a diverse set of labels for visually describing products. We learn an embedding from the supervisory signal provided by every label to encode their interrelationships. Once trained, every label has a corresponding visual representation in the embedding space, which is an aggregation of selected items from the training set. At search time, composite query representations retrieve images that match a specific set of diverse labels. We form composite query representations by averaging over the aggregated representations of each diverse label in the specific set. For evaluation, we extend existing product datasets of cars and clothes with a diverse set of labels. Experiments show the benefits of our embedding for diversely-supervised visual product search in seen and unseen product combinations, and for discovering product design styles.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
This paper strives for a diversely-supervised visual product search, where queries specify a diverse set of labels to search for. Where previous works have focused on representing attribute, instance or category labels individually, we consider them together to create a diverse set of labels for visually describing products. We learn an embedding from the supervisory signal provided by every label to encode their interrelationships. Once trained, every label has a corresponding visual representation in the embedding space, which is an aggregation of selected items from the training set. At search time, composite query representations retrieve images that match a specific set of diverse labels. We form composite query representations by averaging over the aggregated representations of each diverse label in the specific set. For evaluation, we extend existing product datasets of cars and clothes with a diverse set of labels. Experiments show the benefits of our embedding for diversely-supervised visual product search in seen and unseen product combinations, and for discovering product design styles. |
2021
|
| Sander R Klomp, Matthew van Rijn, Rob G J Wijnhoven, Cees G M Snoek, Peter H N de With: Safe Fakes: Evaluating Face Anonymizers for Face Detectors. In: IEEE International Conference on Automatic Face and Gesture Recognition, Jodhpur, India, 2021. @inproceedings{klomp2021safe,
title = {Safe Fakes: Evaluating Face Anonymizers for Face Detectors},
author = {Sander R Klomp and Matthew van Rijn and Rob G J Wijnhoven and Cees G M Snoek and Peter H N de With},
url = {https://isis-data.science.uva.nl/cgmsnoek/pub/klomp-safe-fakes-fg2021.pdf},
year = {2021},
date = {2021-12-15},
urldate = {2021-04-23},
booktitle = {IEEE International Conference on Automatic Face and Gesture Recognition},
address = {Jodhpur, India},
abstract = {Since the introduction of the GDPR and CCPA privacy legislation, both public and private facial image datasets are increasingly scrutinized. Several datasets have been taken offline completely and some have been anonymized. However, it is unclear how anonymization impacts face detection performance. To our knowledge, this paper presents the first empirical study on the effect of image anonymization on supervised training of face detectors. We compare conventional face anonymizers with three state-of-the-art Generative Adversarial Network-based (GAN) methods, by training an off-the-shelf face detector on anonymized data. Our experiments investigate the suitability of anonymization methods for maintaining face detector performance, the effect of detectors overtraining on anonymization artefacts, dataset size for training an anonymizer, and the effect of training time of anonymization GANs. A final experiment investigates the correlation between common GAN evaluation metrics and the performance of a trained face detector. Although all tested anonymization methods lower the performance of trained face detectors, faces anonymized using GANs cause far smaller performance degradation than conventional methods. As the most important finding, the best-performing GAN, DeepPrivacy, removes identifiable faces for a face detector trained on anonymized data, resulting in a modest decrease from 91.0 to 88.3 mAP. In the last few years, there have been rapid improvements in realism of GAN-generated faces. We expect that further progression in GAN research will allow the use of Deep Fake technology for privacy-preserving Safe Fakes, without any performance degradation for training face detectors.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Since the introduction of the GDPR and CCPA privacy legislation, both public and private facial image datasets are increasingly scrutinized. Several datasets have been taken offline completely and some have been anonymized. However, it is unclear how anonymization impacts face detection performance. To our knowledge, this paper presents the first empirical study on the effect of image anonymization on supervised training of face detectors. We compare conventional face anonymizers with three state-of-the-art Generative Adversarial Network-based (GAN) methods, by training an off-the-shelf face detector on anonymized data. Our experiments investigate the suitability of anonymization methods for maintaining face detector performance, the effect of detectors overtraining on anonymization artefacts, dataset size for training an anonymizer, and the effect of training time of anonymization GANs. A final experiment investigates the correlation between common GAN evaluation metrics and the performance of a trained face detector. Although all tested anonymization methods lower the performance of trained face detectors, faces anonymized using GANs cause far smaller performance degradation than conventional methods. As the most important finding, the best-performing GAN, DeepPrivacy, removes identifiable faces for a face detector trained on anonymized data, resulting in a modest decrease from 91.0 to 88.3 mAP. In the last few years, there have been rapid improvements in realism of GAN-generated faces. We expect that further progression in GAN research will allow the use of Deep Fake technology for privacy-preserving Safe Fakes, without any performance degradation for training face detectors. |
| Shuo Chen, Pascal Mettes, Cees G M Snoek: Diagnosing Errors in Video Relation Detectors. In: BMVC, Virtual, 2021. @inproceedings{ChenBMVC2021,
title = {Diagnosing Errors in Video Relation Detectors},
author = {Shuo Chen and Pascal Mettes and Cees G M Snoek},
url = {https://isis-data.science.uva.nl/cgmsnoek/pub/chen-diagnosing-bmvc2021.pdf
https://github.com/shanshuo/DiagnoseVRD
},
year = {2021},
date = {2021-11-01},
urldate = {2021-11-01},
booktitle = {BMVC},
address = {Virtual},
abstract = {Video relation detection forms a new and challenging problem in computer vision, where subjects and objects need to be localized spatio-temporally and a predicate label needs to be assigned if and only if there is an interaction between the two. Despite recent progress in video relation detection, overall performance is still marginal and it remains unclear what the key factors are towards solving the problem. Following examples set in the object detection and action localization literature, we perform a deep dive into the error diagnosis of current video relation detection approaches. We introduce a diagnostic tool for analyzing the sources of detection errors. Our tool evaluates and compares current approaches beyond the single scalar metric of mean Average Precision by defining different error types specific to video relation detection, used for false positive analyses. Moreover, we examine different factors of influence on the performance in a false negative analysis, including relation length, number of subject/object/predicate instances, and subject/object size. Finally, we present the effect on video relation performance when considering an oracle fix for each error type. On two video relation benchmarks, we show where current approaches excel and fall short, allowing us to pinpoint the most important future directions in the field. The tool is available at https://github.com/shanshuo/DiagnoseVRD.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Video relation detection forms a new and challenging problem in computer vision, where subjects and objects need to be localized spatio-temporally and a predicate label needs to be assigned if and only if there is an interaction between the two. Despite recent progress in video relation detection, overall performance is still marginal and it remains unclear what the key factors are towards solving the problem. Following examples set in the object detection and action localization literature, we perform a deep dive into the error diagnosis of current video relation detection approaches. We introduce a diagnostic tool for analyzing the sources of detection errors. Our tool evaluates and compares current approaches beyond the single scalar metric of mean Average Precision by defining different error types specific to video relation detection, used for false positive analyses. Moreover, we examine different factors of influence on the performance in a false negative analysis, including relation length, number of subject/object/predicate instances, and subject/object size. Finally, we present the effect on video relation performance when considering an oracle fix for each error type. On two video relation benchmarks, we show where current approaches excel and fall short, allowing us to pinpoint the most important future directions in the field. The tool is available at https://github.com/shanshuo/DiagnoseVRD. |
| William Thong, Cees G M Snoek: Feature and Label Embedding Spaces Matter in Addressing Image Classifier Bias. In: BMVC, Virtual, 2021. @inproceedings{ThongBMVC2021,
title = {Feature and Label Embedding Spaces Matter in Addressing Image Classifier Bias},
author = {William Thong and Cees G M Snoek},
url = {https://isis-data.science.uva.nl/cgmsnoek/pub/thong-image-classifier-bias-bmvc2021.pdf
https://github.com/twuilliam/bias-classifiers},
year = {2021},
date = {2021-11-01},
urldate = {2021-11-01},
booktitle = {BMVC},
address = {Virtual},
abstract = {This paper strives to address image classifier bias, with a focus on both feature and label embedding spaces. Previous works have shown that spurious correlations from protected attributes, such as age, gender, or skin tone, can cause adverse decisions. To balance potential harms, there is a growing need to identify and mitigate image classifier bias. First, we identify in the feature space a bias direction. We compute class prototypes of each protected attribute value for every class, and reveal an existing subspace that captures the maximum variance of the bias. Second, we mitigate biases by mapping image inputs to label embedding spaces. Each value of the protected attribute has its projection head where classes are embedded through a latent vector representation rather than a common one-hot encoding. Once trained, we further reduce in the feature space the bias effect by removing its direction. Evaluation on biased image datasets, for multi-class, multi-label and binary classifications, shows the effectiveness of tackling both feature and label embedding spaces in improving the fairness of the classifier predictions, while preserving classification performance.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper strives to address image classifier bias, with a focus on both feature and label embedding spaces. Previous works have shown that spurious correlations from protected attributes, such as age, gender, or skin tone, can cause adverse decisions. To balance potential harms, there is a growing need to identify and mitigate image classifier bias. First, we identify in the feature space a bias direction. We compute class prototypes of each protected attribute value for every class, and reveal an existing subspace that captures the maximum variance of the bias. Second, we mitigate biases by mapping image inputs to label embedding spaces. Each value of the protected attribute has its projection head where classes are embedded through a latent vector representation rather than a common one-hot encoding. Once trained, we further reduce in the feature space the bias effect by removing its direction. Evaluation on biased image datasets, for multi-class, multi-label and binary classifications, shows the effectiveness of tackling both feature and label embedding spaces in improving the fairness of the classifier predictions, while preserving classification performance. |
| Fida Mohammad Thoker, Hazel Doughty, Cees G M Snoek
: Skeleton-Contrastive 3D Action Representation Learning. In: MM, Chengdu, China, 2021. @inproceedings{ThokerMM21,
title = {Skeleton-Contrastive 3D Action Representation Learning},
author = {Fida Mohammad Thoker and Hazel Doughty and Cees G M Snoek
},
url = {https://arxiv.org/abs/2108.03656
https://github.com/fmthoker/skeleton-contrast},
year = {2021},
date = {2021-10-20},
booktitle = {MM},
address = {Chengdu, China},
abstract = {This paper strives for self-supervised learning of a feature space suitable for skeleton-based action recognition. Our proposal is built upon learning invariances to input skeleton representations and various skeleton augmentations via a noise contrastive estimation. In particular, we propose inter-skeleton contrastive learning, which learns from multiple different input skeleton representations in a cross-contrastive manner. In addition, we contribute several skeleton-specific spatial and temporal augmentations which further encourage the model to learn the spatio-temporal dynamics of skeleton data. By learning similarities between different skeleton representations as well as augmented views of the same sequence, the network is encouraged to learn higher-level semantics of the skeleton data than when only using the augmented views. Our approach achieves state-of-the-art performance for self-supervised learning from skeleton data on the challenging PKU and NTU datasets with multiple downstream tasks, including action recognition, action retrieval and semi-supervised learning.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper strives for self-supervised learning of a feature space suitable for skeleton-based action recognition. Our proposal is built upon learning invariances to input skeleton representations and various skeleton augmentations via a noise contrastive estimation. In particular, we propose inter-skeleton contrastive learning, which learns from multiple different input skeleton representations in a cross-contrastive manner. In addition, we contribute several skeleton-specific spatial and temporal augmentations which further encourage the model to learn the spatio-temporal dynamics of skeleton data. By learning similarities between different skeleton representations as well as augmented views of the same sequence, the network is encouraged to learn higher-level semantics of the skeleton data than when only using the augmented views. Our approach achieves state-of-the-art performance for self-supervised learning from skeleton data on the challenging PKU and NTU datasets with multiple downstream tasks, including action recognition, action retrieval and semi-supervised learning. |
| Kirill Gavrilyuk, Mihir Jain, Ilia Karmanov, Cees G M Snoek: Motion-Augmented Self-Training for Video Recognition at Smaller Scale. In: ICCV, Montreal, Canada, 2021. @inproceedings{gavrilyuk2021motionaugmented,
title = {Motion-Augmented Self-Training for Video Recognition at Smaller Scale},
author = {Kirill Gavrilyuk and Mihir Jain and Ilia Karmanov and Cees G M Snoek},
url = {https://isis-data.science.uva.nl/cgmsnoek/pub/gavrilyuk-motionfit-iccv2021.pdf},
year = {2021},
date = {2021-10-11},
booktitle = {ICCV},
address = {Montreal, Canada},
abstract = {The goal of this paper is to self-train a 3D convolutional neural network on an unlabeled video collection for deployment on small-scale video collections. As smaller video datasets benefit more from motion than appearance, we strive to train our network using optical flow, but avoid its computation during inference. We propose the first motion-augmented self-training regime, we call MotionFit. We start with supervised training of a motion model on a small, and labeled, video collection. With the motion model we generate pseudo-labels for a large unlabeled video collection, which enables us to transfer knowledge by learning to predict these pseudo-labels with an appearance model. Moreover, we introduce a multi-clip loss as a simple yet efficient way to improve the quality of the pseudo-labeling, even without additional auxiliary tasks. We also take into consideration the temporal granularity of videos during self-training of the appearance model, which was missed in previous works. As a result we obtain a strong motion-augmented representation model suited for video downstream tasks like action recognition and clip retrieval. On small-scale video datasets, MotionFit outperforms alternatives for knowledge transfer by 5%-8%, video-only self-supervision by 1%-7% and semi-supervised learning by 9%-18% using the same amount of class labels.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
The goal of this paper is to self-train a 3D convolutional neural network on an unlabeled video collection for deployment on small-scale video collections. As smaller video datasets benefit more from motion than appearance, we strive to train our network using optical flow, but avoid its computation during inference. We propose the first motion-augmented self-training regime, we call MotionFit. We start with supervised training of a motion model on a small, and labeled, video collection. With the motion model we generate pseudo-labels for a large unlabeled video collection, which enables us to transfer knowledge by learning to predict these pseudo-labels with an appearance model. Moreover, we introduce a multi-clip loss as a simple yet efficient way to improve the quality of the pseudo-labeling, even without additional auxiliary tasks. We also take into consideration the temporal granularity of videos during self-training of the appearance model, which was missed in previous works. As a result we obtain a strong motion-augmented representation model suited for video downstream tasks like action recognition and clip retrieval. On small-scale video datasets, MotionFit outperforms alternatives for knowledge transfer by 5%-8%, video-only self-supervision by 1%-7% and semi-supervised learning by 9%-18% using the same amount of class labels. |
| Shuo Chen, Zenglin Shi, Pascal Mettes, Cees G M Snoek: Social Fabric: Tubelet Compositions for Video Relation Detection. In: ICCV, Montreal, Canada, 2021. @inproceedings{ChenRelationICCV21,
title = {Social Fabric: Tubelet Compositions for Video Relation Detection},
author = {Shuo Chen and Zenglin Shi and Pascal Mettes and Cees G M Snoek},
url = {https://isis-data.science.uva.nl/cgmsnoek/pub/chen-social-fabric-iccv2021.pdf
https://github.com/shanshuo/Social-Fabric},
year = {2021},
date = {2021-10-11},
booktitle = {ICCV},
address = {Montreal, Canada},
abstract = {This paper strives to classify and detect the relationship between object tubelets appearing within a video as a ⟨subject-predicate-object⟩ triplet. Where existing works treat object proposals or tubelets as single entities and model their relations a posteriori, we propose to classify and detect predicates for pairs of object tubelets a priori. We also propose Social Fabric: an encoding that represents a pair of object tubelets as a composition of interaction primitives. These primitives are learned over all relations, resulting in a compact representation able to localize and classify relations from the pool of co-occurring object tubelets across all timespans in a video. The encoding enables our two-stage network. In the first stage, we train Social Fabric to suggest proposals that are likely interacting. We use the Social Fabric in the second stage to simultaneously fine-tune and predict predicate labels for the tubelets. Experiments demonstrate the benefit of early video relation modeling, our encoding and the two-stage architecture, leading to a new state-of-the-art on two benchmarks. We also show how the encoding enables query-by-primitive-example to search for spatio-temporal video relations. Code: https://github.com/shanshuo/Social-Fabric.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper strives to classify and detect the relationship between object tubelets appearing within a video as a ⟨subject-predicate-object⟩ triplet. Where existing works treat object proposals or tubelets as single entities and model their relations a posteriori, we propose to classify and detect predicates for pairs of object tubelets a priori. We also propose Social Fabric: an encoding that represents a pair of object tubelets as a composition of interaction primitives. These primitives are learned over all relations, resulting in a compact representation able to localize and classify relations from the pool of co-occurring object tubelets across all timespans in a video. The encoding enables our two-stage network. In the first stage, we train Social Fabric to suggest proposals that are likely interacting. We use the Social Fabric in the second stage to simultaneously fine-tune and predict predicate labels for the tubelets. Experiments demonstrate the benefit of early video relation modeling, our encoding and the two-stage architecture, leading to a new state-of-the-art on two benchmarks. We also show how the encoding enables query-by-primitive-example to search for spatio-temporal video relations. Code: https://github.com/shanshuo/Social-Fabric. |
| Joska Appelman, Kiki Bijleveld, Peter Ejbye-Ernst, Evelien Hoeben, Lasse Liebst, Cees Snoek, Dennis Koelma, Marie Rosenkrantz Lindegaard: Naleving van gedragsmaatregelen tijdens de COVID-19-pandemie. In: Justitiele Verkenningen, vol. 47, no. 3, pp. 54–71, 2021. @article{AppelmanJV2021,
title = {Naleving van gedragsmaatregelen tijdens de COVID-19-pandemie},
author = {Joska Appelman and Kiki Bijleveld and Peter Ejbye-Ernst and Evelien Hoeben and Lasse Liebst and Cees Snoek and Dennis Koelma and Marie Rosenkrantz Lindegaard},
doi = {https://doi.org/10.5553/JV/016758502021047003004},
year = {2021},
date = {2021-10-01},
urldate = {2021-10-01},
journal = {Justitiele Verkenningen},
volume = {47},
number = {3},
pages = {54--71},
abstract = {To mitigate the spread of the COVID-19 virus, the Dutch government has implemented several rules and regulations during the pandemic. Compliance with these rules and regulations is crucial for its effectiveness. In the current article, the authors give an overview of research findings from three different studies looking at compliance with the COVID-19 mitigating measures in the Netherlands. In these studies, both manual and computer-based video-analysis is used to give insight in the behavior of people on the streets of Amsterdam. Study 1 monitors compliance with the social distancing directive and stay-at-home advice, showing that people keep less distance when it is crowded on the street. Study 2 focusses on compliance with mandatory mask-wearing and shows that mask-wearing increases with the implementation of the mandatory mask areas, but crowding does not decrease. Finally, Study 3 looks at compliance of citizens during the curfew and shows that streets became far less crowded after 9 p.m. during curfew nights.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
To mitigate the spread of the COVID-19 virus, the Dutch government has implemented several rules and regulations during the pandemic. Compliance with these rules and regulations is crucial for its effectiveness. In the current article, the authors give an overview of research findings from three different studies looking at compliance with the COVID-19 mitigating measures in the Netherlands. In these studies, both manual and computer-based video-analysis is used to give insight in the behavior of people on the streets of Amsterdam. Study 1 monitors compliance with the social distancing directive and stay-at-home advice, showing that people keep less distance when it is crowded on the street. Study 2 focusses on compliance with mandatory mask-wearing and shows that mask-wearing increases with the implementation of the mandatory mask areas, but crowding does not decrease. Finally, Study 3 looks at compliance of citizens during the curfew and shows that streets became far less crowded after 9 p.m. during curfew nights. |
| Zenglin Shi, Yunlu Chen, Efstratios Gavves, Pascal Mettes, Cees G M Snoek: Unsharp Mask Guided Filtering. In: IEEE Transactions on Image Processing, vol. 30, pp. 7472-7485, 2021. @article{ShiTIP21,
title = {Unsharp Mask Guided Filtering},
author = {Zenglin Shi and Yunlu Chen and Efstratios Gavves and Pascal Mettes and Cees G M Snoek},
url = {https://arxiv.org/abs/2106.01428
https://github.com/shizenglin/Unsharp-Mask-Guided-Filtering},
doi = {10.1109/TIP.2021.3106812},
year = {2021},
date = {2021-09-01},
journal = {IEEE Transactions on Image Processing},
volume = {30},
pages = {7472-7485},
abstract = {The goal of this paper is guided image filtering, which emphasizes the importance of structure transfer during filtering by means of an additional guidance image. Where classical guided filters transfer structures using hand-designed functions, recent guided filters have been considerably advanced through parametric learning of deep networks. The state-of-the-art leverages deep networks to estimate the two core coefficients of the guided filter. In this work, we posit that simultaneously estimating both coefficients is suboptimal, resulting in halo artifacts and structure inconsistencies. Inspired by unsharp masking, a classical technique for edge enhancement that requires only a single coefficient, we propose a new and simplified formulation of the guided filter. Our formulation enjoys a filtering prior from a low-pass filter and enables explicit structure transfer by estimating a single coefficient. Based on our proposed formulation, we introduce a successive guided filtering network, which provides multiple filtering results from a single network, allowing for a trade-off between accuracy and efficiency. Extensive ablations, comparisons and analysis show the effectiveness and efficiency of our formulation and network, resulting in state-of-the-art results across filtering tasks like upsampling, denoising, and cross-modality filtering. },
keywords = {},
pubstate = {published},
tppubtype = {article}
}
The goal of this paper is guided image filtering, which emphasizes the importance of structure transfer during filtering by means of an additional guidance image. Where classical guided filters transfer structures using hand-designed functions, recent guided filters have been considerably advanced through parametric learning of deep networks. The state-of-the-art leverages deep networks to estimate the two core coefficients of the guided filter. In this work, we posit that simultaneously estimating both coefficients is suboptimal, resulting in halo artifacts and structure inconsistencies. Inspired by unsharp masking, a classical technique for edge enhancement that requires only a single coefficient, we propose a new and simplified formulation of the guided filter. Our formulation enjoys a filtering prior from a low-pass filter and enables explicit structure transfer by estimating a single coefficient. Based on our proposed formulation, we introduce a successive guided filtering network, which provides multiple filtering results from a single network, allowing for a trade-off between accuracy and efficiency. Extensive ablations, comparisons and analysis show the effectiveness and efficiency of our formulation and network, resulting in state-of-the-art results across filtering tasks like upsampling, denoising, and cross-modality filtering. |
| Yingjun Du, Nithin Holla, Xiantong Zhen, Cees G M Snoek, Ekaterina Shutova: Meta-Learning with Variational Semantic Memory for Word Sense Disambiguation. In: ACL-IJCNLP, Bangkok, Thailand, 2021. @inproceedings{DuACL2021,
title = {Meta-Learning with Variational Semantic Memory for Word Sense Disambiguation},
author = {Yingjun Du and Nithin Holla and Xiantong Zhen and Cees G M Snoek and Ekaterina Shutova},
url = {https://isis-data.science.uva.nl/cgmsnoek/pub/du-word-sense-memory-acl2021.pdf
https://github.com/YDU-uva/VSM_WSD},
year = {2021},
date = {2021-08-01},
booktitle = {ACL-IJCNLP},
address = {Bangkok, Thailand},
abstract = {A critical challenge faced by supervised word sense disambiguation (WSD) is the lack of large annotated datasets with sufficient coverage of words in their diversity of senses. This inspired recent research on few-shot WSD using meta-learning. While such work has successfully applied meta-learning to learn new word senses from very few examples, its performance still lags behind its fully-supervised counterpart. Aiming to further close this gap, we propose a model of semantic memory for WSD in a meta-learning setting. Semantic memory encapsulates prior experiences seen throughout the lifetime of the model, which aids better generalization in limited data settings. Our model is based on hierarchical variational inference and incorporates an adaptive memory update rule via a hypernetwork. We show our model advances the state of the art in few-shot WSD, supports effective learning in extremely data scarce (e.g. one-shot) scenarios and produces meaning prototypes that capture similar senses of distinct words.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
A critical challenge faced by supervised word sense disambiguation (WSD) is the lack of large annotated datasets with sufficient coverage of words in their diversity of senses. This inspired recent research on few-shot WSD using meta-learning. While such work has successfully applied meta-learning to learn new word senses from very few examples, its performance still lags behind its fully-supervised counterpart. Aiming to further close this gap, we propose a model of semantic memory for WSD in a meta-learning setting. Semantic memory encapsulates prior experiences seen throughout the lifetime of the model, which aids better generalization in limited data settings. Our model is based on hierarchical variational inference and incorporates an adaptive memory update rule via a hypernetwork. We show our model advances the state of the art in few-shot WSD, supports effective learning in extremely data scarce (e.g. one-shot) scenarios and produces meaning prototypes that capture similar senses of distinct words. |
| Mohammad Mahdi Derakhshani, Xiantong Zhen, Ling Shao, Cees G M Snoek: Kernel Continual Learning. In: ICML, Vienna, Austria, 2021. @inproceedings{DerakhshaniICML21,
title = {Kernel Continual Learning},
author = {Mohammad Mahdi Derakhshani and Xiantong Zhen and Ling Shao and Cees G M Snoek},
url = {https://isis-data.science.uva.nl/cgmsnoek/pub/derakhshani-kernel-continual-icml2021.pdf
https://github.com/mmderakhshani/KCL},
year = {2021},
date = {2021-07-01},
booktitle = {ICML},
address = {Vienna, Austria},
abstract = {This paper introduces kernel continual learning, a simple but effective variant of continual learning that leverages the non-parametric nature of kernel methods to tackle catastrophic forgetting. We deploy an episodic memory unit that stores a subset of samples for each task to learn task-specific classifiers based on kernel ridge regression. This does not require memory replay and systematically avoids task interference in the classifiers. We further introduce variational random features to learn a data-driven kernel for each task. To do so, we formulate kernel continual learning as a variational inference problem, where a random Fourier basis is incorporated as the latent variable. The variational posterior distribution over the random Fourier basis is inferred from the coreset of each task. In this way, we are able to generate more informative kernels specific to each task, and, more importantly, the coreset size can be reduced to achieve more compact memory, resulting in more efficient continual learning based on episodic memory. Extensive evaluation on four benchmarks demonstrates the effectiveness and promise of kernels for continual learning. },
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper introduces kernel continual learning, a simple but effective variant of continual learning that leverages the non-parametric nature of kernel methods to tackle catastrophic forgetting. We deploy an episodic memory unit that stores a subset of samples for each task to learn task-specific classifiers based on kernel ridge regression. This does not require memory replay and systematically avoids task interference in the classifiers. We further introduce variational random features to learn a data-driven kernel for each task. To do so, we formulate kernel continual learning as a variational inference problem, where a random Fourier basis is incorporated as the latent variable. The variational posterior distribution over the random Fourier basis is inferred from the coreset of each task. In this way, we are able to generate more informative kernels specific to each task, and, more importantly, the coreset size can be reduced to achieve more compact memory, resulting in more efficient continual learning based on episodic memory. Extensive evaluation on four benchmarks demonstrates the effectiveness and promise of kernels for continual learning. |
| Zehao Xiao, Jiayi Shen, Xiantong Zhen, Ling Shao, Cees G M Snoek: A Bit More Bayesian: Domain-Invariant Learning with Uncertainty. In: ICML, Vienna, Austria, 2021. @inproceedings{XiaoICML21,
title = {A Bit More Bayesian: Domain-Invariant Learning with Uncertainty},
author = {Zehao Xiao and Jiayi Shen and Xiantong Zhen and Ling Shao and Cees G M Snoek},
url = {https://arxiv.org/abs/2105.04030
https://github.com/zzzx1224/A-Bit-More-Bayesian},
year = {2021},
date = {2021-07-01},
booktitle = {ICML},
address = {Vienna, Austria},
abstract = {Domain generalization is challenging due to the domain shift and the uncertainty caused by the inaccessibility of target domain data. In this paper, we address both challenges with a probabilistic framework based on variational Bayesian inference, by incorporating uncertainty into neural network weights. We couple domain invariance in a probabilistic formula with the variational Bayesian inference. This enables us to explore domain-invariant learning in a principled way. Specifically, we derive domain-invariant representations and classifiers, which are jointly established in a two-layer Bayesian neural network. We empirically demonstrate the effectiveness of our proposal on four widely used cross-domain visual recognition benchmarks. Ablation studies validate the synergistic benefits of our Bayesian treatment when jointly learning domain-invariant representations and classifiers for domain generalization. Further, our method consistently delivers state-of-the-art mean accuracy on all benchmarks.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Domain generalization is challenging due to the domain shift and the uncertainty caused by the inaccessibility of target domain data. In this paper, we address both challenges with a probabilistic framework based on variational Bayesian inference, by incorporating uncertainty into neural network weights. We couple domain invariance in a probabilistic formula with the variational Bayesian inference. This enables us to explore domain-invariant learning in a principled way. Specifically, we derive domain-invariant representations and classifiers, which are jointly established in a two-layer Bayesian neural network. We empirically demonstrate the effectiveness of our proposal on four widely used cross-domain visual recognition benchmarks. Ablation studies validate the synergistic benefits of our Bayesian treatment when jointly learning domain-invariant representations and classifiers for domain generalization. Further, our method consistently delivers state-of-the-art mean accuracy on all benchmarks. |
| Pengwan Yang, Pascal Mettes, Cees G M Snoek: Few-Shot Transformation of Common Actions into Time and Space. In: CVPR, Nashville, USA, 2021. @inproceedings{YangCVPR21,
title = {Few-Shot Transformation of Common Actions into Time and Space},
author = {Pengwan Yang and Pascal Mettes and Cees G M Snoek},
url = {https://isis-data.science.uva.nl/cgmsnoek/pub/yang-common-time-space-cvpr2021.pdf
https://github.com/PengWan-Yang/few-shot-transformer},
year = {2021},
date = {2021-06-01},
booktitle = {CVPR},
address = {Nashville, USA},
abstract = {This paper introduces the task of few-shot common action localization in time and space. Given a few trimmed support videos containing the same but unknown action, we strive for spatio-temporal localization of that action in a long untrimmed query video. We do not require any class labels, interval bounds, or bounding boxes. To address this challenging task, we introduce a novel few-shot transformer architecture with a dedicated encoder-decoder structure optimized for joint commonality learning and localization prediction, without the need for proposals. Experiments on reorganizations of the AVA and UCF101-24 datasets show the effectiveness of our approach for few-shot common action localization, even when the support videos are noisy. Although we are not specifically designed for common localization in time only, we also compare favorably against the few-shot and one-shot state-of-the-art in this setting. Lastly, we demonstrate that the few-shot transformer is easily extended to common action localization per pixel.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper introduces the task of few-shot common action localization in time and space. Given a few trimmed support videos containing the same but unknown action, we strive for spatio-temporal localization of that action in a long untrimmed query video. We do not require any class labels, interval bounds, or bounding boxes. To address this challenging task, we introduce a novel few-shot transformer architecture with a dedicated encoder-decoder structure optimized for joint commonality learning and localization prediction, without the need for proposals. Experiments on reorganizations of the AVA and UCF101-24 datasets show the effectiveness of our approach for few-shot common action localization, even when the support videos are noisy. Although we are not specifically designed for common localization in time only, we also compare favorably against the few-shot and one-shot state-of-the-art in this setting. Lastly, we demonstrate that the few-shot transformer is easily extended to common action localization per pixel. |
| Yunhua Zhang, Ling Shao, Cees G M Snoek: Repetitive Activity Counting by Sight and Sound. In: CVPR, Nashville, USA, 2021. @inproceedings{ZhangCVPR21,
title = {Repetitive Activity Counting by Sight and Sound},
author = {Yunhua Zhang and Ling Shao and Cees G M Snoek},
url = {https://arxiv.org/abs/2103.13096
https://github.com/xiaobai1217/RepetitionCounting},
year = {2021},
date = {2021-06-01},
booktitle = {CVPR},
address = {Nashville, USA},
abstract = {This paper strives for repetitive activity counting in videos. Different from existing works, which all analyze the visual video content only, we incorporate for the first time the corresponding sound into the repetition counting process. This benefits accuracy in challenging vision conditions such as occlusion, dramatic camera view changes, low resolution, etc. We propose a model that starts with analyzing the sight and sound streams separately. Then an audiovisual temporal stride decision module and a reliability estimation module are introduced to exploit cross-modal temporal interaction. For learning and evaluation, an existing dataset is repurposed and reorganized to allow for repetition counting with sight and sound. We also introduce a variant of this dataset for repetition counting under challenging vision conditions. Experiments demonstrate the benefit of sound, as well as the other introduced modules, for repetition counting. Our sight-only model already outperforms the state-of-the-art by itself, when we add sound, results improve notably, especially under harsh vision conditions. The code and datasets are available at https://github.com/xiaobai1217/RepetitionCounting.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper strives for repetitive activity counting in videos. Different from existing works, which all analyze the visual video content only, we incorporate for the first time the corresponding sound into the repetition counting process. This benefits accuracy in challenging vision conditions such as occlusion, dramatic camera view changes, low resolution, etc. We propose a model that starts with analyzing the sight and sound streams separately. Then an audiovisual temporal stride decision module and a reliability estimation module are introduced to exploit cross-modal temporal interaction. For learning and evaluation, an existing dataset is repurposed and reorganized to allow for repetition counting with sight and sound. We also introduce a variant of this dataset for repetition counting under challenging vision conditions. Experiments demonstrate the benefit of sound, as well as the other introduced modules, for repetition counting. Our sight-only model already outperforms the state-of-the-art by itself, when we add sound, results improve notably, especially under harsh vision conditions. The code and datasets are available at https://github.com/xiaobai1217/RepetitionCounting. |
| Yingjun Du, Xiantong Zhen, Ling Shao, Cees G M Snoek: MetaNorm: Learning to Normalize Few-Shot Batches Across Domains. In: ICLR, Vienna, Austria, 2021. @inproceedings{DuICLR21,
title = {MetaNorm: Learning to Normalize Few-Shot Batches Across Domains},
author = {Yingjun Du and Xiantong Zhen and Ling Shao and Cees G M Snoek},
url = {https://isis-data.science.uva.nl/cgmsnoek/pub/du-metanorm-iclr2021.pdf
https://github.com/YDU-AI/MetaNorm},
year = {2021},
date = {2021-05-01},
booktitle = {ICLR},
address = {Vienna, Austria},
abstract = {Batch normalization plays a crucial role when training deep neural networks. However, batch statistics become unstable with small batch sizes and are unreliable in the presence of distribution shifts. We propose MetaNorm, a simple yet effective meta-learning normalization. It tackles the aforementioned issues in a unified way by leveraging the meta-learning setting and learns to infer adaptive statistics for batch normalization. MetaNorm is generic, flexible and model-agnostic, making it a simple plug-and-play module that is seamlessly embedded into existing meta-learning approaches. It can be efficiently implemented by lightweight hypernetworks with low computational cost. We verify its effectiveness by extensive evaluation on representative tasks suffering from the small batch and domain shift problems: few-shot learning and domain generalization. We further introduce an even more challenging setting: few-shot domain generalization. Results demonstrate that MetaNorm consistently achieves better, or at least competitive, accuracy compared to existing batch normalization methods. },
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Batch normalization plays a crucial role when training deep neural networks. However, batch statistics become unstable with small batch sizes and are unreliable in the presence of distribution shifts. We propose MetaNorm, a simple yet effective meta-learning normalization. It tackles the aforementioned issues in a unified way by leveraging the meta-learning setting and learns to infer adaptive statistics for batch normalization. MetaNorm is generic, flexible and model-agnostic, making it a simple plug-and-play module that is seamlessly embedded into existing meta-learning approaches. It can be efficiently implemented by lightweight hypernetworks with low computational cost. We verify its effectiveness by extensive evaluation on representative tasks suffering from the small batch and domain shift problems: few-shot learning and domain generalization. We further introduce an even more challenging setting: few-shot domain generalization. Results demonstrate that MetaNorm consistently achieves better, or at least competitive, accuracy compared to existing batch normalization methods. |
| David W Zhang, Gertjan J Burghouts, Cees G M Snoek: Set Prediction without Imposing Structure as Conditional Density Estimation. In: ICLR, Vienna, Austria, 2021. @inproceedings{ZhangICLR21,
title = {Set Prediction without Imposing Structure as Conditional Density Estimation},
author = {David W Zhang and Gertjan J Burghouts and Cees G M Snoek},
url = {https://arxiv.org/abs/2010.04109
https://github.com/davzha/DESP},
year = {2021},
date = {2021-05-01},
booktitle = {ICLR},
address = {Vienna, Austria},
abstract = {Set prediction is about learning to predict a collection of unordered variables with unknown interrelations. Training such models with set losses imposes the structure of a metric space over sets. We focus on stochastic and underdefined cases, where an incorrectly chosen loss function leads to implausible predictions. Example tasks include conditional point-cloud reconstruction and predicting future states of molecules. In this paper, we propose an alternative to training via set losses by viewing learning as conditional density estimation. Our learning framework fits deep energy-based models and approximates the intractable likelihood with gradient-guided sampling. Furthermore, we propose a stochastically augmented prediction algorithm that enables multiple predictions, reflecting the possible variations in the target set. We empirically demonstrate on a variety of datasets the capability to learn multi-modal densities and produce different plausible predictions. Our approach is competitive with previous set prediction models on standard benchmarks. More importantly, it extends the family of addressable tasks beyond those that have unambiguous predictions.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Set prediction is about learning to predict a collection of unordered variables with unknown interrelations. Training such models with set losses imposes the structure of a metric space over sets. We focus on stochastic and underdefined cases, where an incorrectly chosen loss function leads to implausible predictions. Example tasks include conditional point-cloud reconstruction and predicting future states of molecules. In this paper, we propose an alternative to training via set losses by viewing learning as conditional density estimation. Our learning framework fits deep energy-based models and approximates the intractable likelihood with gradient-guided sampling. Furthermore, we propose a stochastically augmented prediction algorithm that enables multiple predictions, reflecting the possible variations in the target set. We empirically demonstrate on a variety of datasets the capability to learn multi-modal densities and produce different plausible predictions. Our approach is competitive with previous set prediction models on standard benchmarks. More importantly, it extends the family of addressable tasks beyond those that have unambiguous predictions. |
| Jiaojiao Zhao, Cees G M Snoek: LiftPool: Bidirectional ConvNet Pooling. In: ICLR, Vienna, Austria, 2021. @inproceedings{ZhaoICLR21,
title = {LiftPool: Bidirectional ConvNet Pooling},
author = {Jiaojiao Zhao and Cees G M Snoek},
url = {https://isis-data.science.uva.nl/cgmsnoek/pub/zhao-liftpool-iclr2021.pdf
https://github.com/jiaozizhao/LiftPool/},
year = {2021},
date = {2021-05-01},
booktitle = {ICLR},
address = {Vienna, Austria},
abstract = {Pooling is a critical operation in convolutional neural networks for increasing receptive fields and improving robustness to input variations. Most existing pooling operations downsample the feature maps, which is a lossy process. Moreover, they are not invertible: upsampling a downscaled feature map can not recover the lost information in the downsampling. By adopting the philosophy of the classical Lifting Scheme from signal processing, we propose LiftPool for bidirectional pooling layers, including LiftDownPool and LiftUpPool. LiftDownPool decomposes a feature map into various downsized sub-bands, each of which contains information with different frequencies. As the pooling function in LiftDownPool is perfectly invertible, by performing LiftDownPool backwards, a corresponding up-pooling layer LiftUpPool is able to generate a refined upsampled feature map using the detail sub-bands, which is useful for image-to-image translation challenges. Experiments show the proposed methods achieve better results on image classification and semantic segmentation, using various backbones. Moreover, LiftDownPool offers better robustness to input corruptions and perturbations.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Pooling is a critical operation in convolutional neural networks for increasing receptive fields and improving robustness to input variations. Most existing pooling operations downsample the feature maps, which is a lossy process. Moreover, they are not invertible: upsampling a downscaled feature map can not recover the lost information in the downsampling. By adopting the philosophy of the classical Lifting Scheme from signal processing, we propose LiftPool for bidirectional pooling layers, including LiftDownPool and LiftUpPool. LiftDownPool decomposes a feature map into various downsized sub-bands, each of which contains information with different frequencies. As the pooling function in LiftDownPool is perfectly invertible, by performing LiftDownPool backwards, a corresponding up-pooling layer LiftUpPool is able to generate a refined upsampled feature map using the detail sub-bands, which is useful for image-to-image translation challenges. Experiments show the proposed methods achieve better results on image classification and semantic segmentation, using various backbones. Moreover, LiftDownPool offers better robustness to input corruptions and perturbations. |
| Pascal Mettes, William Thong, Cees G M Snoek: Object Priors for Classifying and Localizing Unseen Actions. In: International Journal of Computer Vision, vol. 129, no. 6, pp. 1954–1971, 2021. @article{MettesIJCV21,
title = {Object Priors for Classifying and Localizing Unseen Actions},
author = {Pascal Mettes and William Thong and Cees G M Snoek},
url = {https://doi.org/10.1007/s11263-021-01454-y
https://github.com/psmmettes/object-priors-unseen-actions},
year = {2021},
date = {2021-04-19},
urldate = {2021-04-19},
journal = {International Journal of Computer Vision},
volume = {129},
number = {6},
pages = {1954–1971},
abstract = {This work strives for the classification and localization of human actions in videos, without the need for any labeled video training examples. Where existing work relies on transferring global attribute or object information from seen to unseen action videos, we seek to classify and spatio-temporally localize unseen actions in videos from image-based object information only. We propose three spatial object priors, which encode local person and object detectors along with their spatial relations. On top we introduce three semantic object priors, which extend semantic matching through word embeddings with three simple functions that tackle semantic ambiguity, object discrimination, and object naming. A video embedding combines the spatial and semantic object priors. It enables us to introduce a new video retrieval task that retrieves action tubes in video collections based on user-specified objects, spatial relations, and object size. Experimental evaluation on five action datasets shows the importance of spatial and semantic object priors for unseen actions. We find that persons and objects have preferred spatial relations that benefit unseen action localization, while using multiple languages and simple object filtering directly improves semantic matching, leading to state-of-the-art results for both unseen action classification and localization.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
This work strives for the classification and localization of human actions in videos, without the need for any labeled video training examples. Where existing work relies on transferring global attribute or object information from seen to unseen action videos, we seek to classify and spatio-temporally localize unseen actions in videos from image-based object information only. We propose three spatial object priors, which encode local person and object detectors along with their spatial relations. On top we introduce three semantic object priors, which extend semantic matching through word embeddings with three simple functions that tackle semantic ambiguity, object discrimination, and object naming. A video embedding combines the spatial and semantic object priors. It enables us to introduce a new video retrieval task that retrieves action tubes in video collections based on user-specified objects, spatial relations, and object size. Experimental evaluation on five action datasets shows the importance of spatial and semantic object priors for unseen actions. We find that persons and objects have preferred spatial relations that benefit unseen action localization, while using multiple languages and simple object filtering directly improves semantic matching, leading to state-of-the-art results for both unseen action classification and localization. |
| Shuai Liao, Efstratios Gavves, ChangYong Oh, Cees G M Snoek: Quasibinary Classifier for Images with Zero and Multiple Labels. In: ICPR, Milan, Italy, 2021. @inproceedings{LiaoICPR21,
title = {Quasibinary Classifier for Images with Zero and Multiple Labels},
author = {Shuai Liao and Efstratios Gavves and ChangYong Oh and Cees G M Snoek},
url = {http://isis-data.science.uva.nl/cgmsnoek/pub/liao-quasibinary-icpr2020.pdf},
year = {2021},
date = {2021-01-01},
booktitle = {ICPR},
address = {Milan, Italy},
abstract = {The softmax and binary classifier are commonly preferred for image classification applications. However, as softmax is specifically designed for categorical classification, it assumes each image has just one class label. This limits its applicability for problems where the number of labels does not equal one, most notably zero- and multi-label problems. In these challenging settings, binary classifiers are, in theory, better suited. However, as they ignore the correlation between classes, they are not as accurate and scalable in practice. In this paper, we start from the observation that the only difference between binary and softmax classifiers is their normalization function. Specifically, while the binary classifier self-normalizes its score, the softmax classifier combines the scores from all classes before normalisation. On the basis of this observation we introduce a normalization function that is learnable, constant, and shared between classes and data points. By doing so, we arrive at a new type of binary classifier that we coin quasibinary classifier. We show in a variety of image classification settings, and on several datasets, that quasibinary classifiers are considerably better in classification settings where regular binary and softmax classifiers suffer, including zero-label and multi-label classification. What is more, we show that quasibinary classifiers yield well-calibrated probabilities allowing for direct and reliable comparisons, not only between classes but also between data points.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
The softmax and binary classifier are commonly preferred for image classification applications. However, as softmax is specifically designed for categorical classification, it assumes each image has just one class label. This limits its applicability for problems where the number of labels does not equal one, most notably zero- and multi-label problems. In these challenging settings, binary classifiers are, in theory, better suited. However, as they ignore the correlation between classes, they are not as accurate and scalable in practice. In this paper, we start from the observation that the only difference between binary and softmax classifiers is their normalization function. Specifically, while the binary classifier self-normalizes its score, the softmax classifier combines the scores from all classes before normalisation. On the basis of this observation we introduce a normalization function that is learnable, constant, and shared between classes and data points. By doing so, we arrive at a new type of binary classifier that we coin quasibinary classifier. We show in a variety of image classification settings, and on several datasets, that quasibinary classifiers are considerably better in classification settings where regular binary and softmax classifiers suffer, including zero-label and multi-label classification. What is more, we show that quasibinary classifiers yield well-calibrated probabilities allowing for direct and reliable comparisons, not only between classes but also between data points. |
| Fida Mohammad Thoker, Cees G M Snoek: Feature-Supervised Action Modality Transfer. In: ICPR, Milan, Italy, 2021. @inproceedings{ThokerICPR21,
title = {Feature-Supervised Action Modality Transfer},
author = {Fida Mohammad Thoker and Cees G M Snoek},
url = {http://isis-data.science.uva.nl/cgmsnoek/pub/thoker-feature-supervised-icpr2020.pdf},
year = {2021},
date = {2021-01-01},
urldate = {2021-01-01},
booktitle = {ICPR},
address = {Milan, Italy},
abstract = {This paper strives for action recognition and detection in video modalities like RGB, depth maps or 3D-skeleton sequences when only limited modality-specific labeled examples are available. For the RGB, and derived optical-flow, modality many large-scale labeled datasets have been made available. They have become the de facto pre-training choice when recognizing or detecting new actions from RGB datasets that have limited amounts of labeled examples available. Unfortunately, large-scale labeled action datasets for other modalities are unavailable for pre-training. In this paper, our goal is to recognize actions from limited examples in non-RGB video modalities, by learning from large-scale labeled RGB data. To this end, we propose a two-step training process: (i) we extract action representation knowledge from an RGB-trained teacher network and adapt it to a non-RGB student network. (ii) we then fine-tune the transfer model with available labeled examples of the target modality. For the knowledge transfer we introduce feature-supervision strategies, which rely on unlabeled pairs of two modalities (the RGB and the target modality) to transfer feature level representations from the teacher to the the student network. Ablations and generalizations with two RGB source datasets and two non-RGB target datasets demonstrate that an optical-flow teacher provides better action transfer features than RGB for both depth maps and 3D-skeletons, even when evaluated on a different target domain, or for a different task. Compared to alternative cross-modal action transfer methods we show a good improvement in performance especially when labeled non-RGB examples to learn from are scarce.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper strives for action recognition and detection in video modalities like RGB, depth maps or 3D-skeleton sequences when only limited modality-specific labeled examples are available. For the RGB, and derived optical-flow, modality many large-scale labeled datasets have been made available. They have become the de facto pre-training choice when recognizing or detecting new actions from RGB datasets that have limited amounts of labeled examples available. Unfortunately, large-scale labeled action datasets for other modalities are unavailable for pre-training. In this paper, our goal is to recognize actions from limited examples in non-RGB video modalities, by learning from large-scale labeled RGB data. To this end, we propose a two-step training process: (i) we extract action representation knowledge from an RGB-trained teacher network and adapt it to a non-RGB student network. (ii) we then fine-tune the transfer model with available labeled examples of the target modality. For the knowledge transfer we introduce feature-supervision strategies, which rely on unlabeled pairs of two modalities (the RGB and the target modality) to transfer feature level representations from the teacher to the the student network. Ablations and generalizations with two RGB source datasets and two non-RGB target datasets demonstrate that an optical-flow teacher provides better action transfer features than RGB for both depth maps and 3D-skeletons, even when evaluated on a different target domain, or for a different task. Compared to alternative cross-modal action transfer methods we show a good improvement in performance especially when labeled non-RGB examples to learn from are scarce. |
| Haochen Wang, Yandan Yang, Xianbin Cao, Xiantong Zhen, Cees G M Snoek, Ling Shao: Variational Prototype Inference for Few-Shot Semantic Segmentation. In: WACV, Waikoloa, Hawaii, USA, 2021. @inproceedings{WangWACV21,
title = {Variational Prototype Inference for Few-Shot Semantic Segmentation},
author = {Haochen Wang and Yandan Yang and Xianbin Cao and Xiantong Zhen and Cees G M Snoek and Ling Shao},
url = {http://isis-data.science.uva.nl/cgmsnoek/pub/wang-proto-inference-wacv2021.pdf},
year = {2021},
date = {2021-01-01},
booktitle = {WACV},
address = {Waikoloa, Hawaii, USA},
abstract = {In this paper, we propose variational prototype inference to address few-shot semantic segmentation in a probabilistic framework. A probabilistic latent variable model infers the distribution of the prototype that is treated as the latent variable. We formulate the optimization as a variational inference problem, which is established with an amortized inference network based on an auto-encoder architecture. The probabilistic modeling of the prototype enhances its generalization ability to handle the inherent uncertainty caused by limited data and the huge intra-class variations of objects. Moreover, it offers a principled way to incorporate the prototype extracted from support images into the prediction of the segmentation maps for query images. We conduct extensive experimental evaluations on three benchmark datasets. Ablation studies show the effectiveness of variational prototype inference for few-shot semantic segmentation by probabilistic modeling. On all three benchmarks, our proposal achieves high segmentation accuracy and surpasses previous methods by considerable margins.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper, we propose variational prototype inference to address few-shot semantic segmentation in a probabilistic framework. A probabilistic latent variable model infers the distribution of the prototype that is treated as the latent variable. We formulate the optimization as a variational inference problem, which is established with an amortized inference network based on an auto-encoder architecture. The probabilistic modeling of the prototype enhances its generalization ability to handle the inherent uncertainty caused by limited data and the huge intra-class variations of objects. Moreover, it offers a principled way to incorporate the prototype extracted from support images into the prediction of the segmentation maps for query images. We conduct extensive experimental evaluations on three benchmark datasets. Ablation studies show the effectiveness of variational prototype inference for few-shot semantic segmentation by probabilistic modeling. On all three benchmarks, our proposal achieves high segmentation accuracy and surpasses previous methods by considerable margins. |
2020
|
| Xiantong Zhen, Yingjun Du, Huan Xiong, Qiang Qiu, Cees Snoek, Ling Shao: Learning to Learn Variational Semantic Memory. In: NeurIPS, Vancouver, Canada, 2020. @inproceedings{ZhenNeurIPS20,
title = {Learning to Learn Variational Semantic Memory},
author = {Xiantong Zhen and Yingjun Du and Huan Xiong and Qiang Qiu and Cees Snoek and Ling Shao},
url = {https://arxiv.org/abs/2010.10341
https://github.com/YDU-uva/VSM},
year = {2020},
date = {2020-12-01},
booktitle = {NeurIPS},
address = {Vancouver, Canada},
abstract = {In this paper, we introduce variational semantic memory into meta-learning to acquire long-term knowledge for few-shot learning. The variational semantic memory accrues and stores semantic information for the probabilistic inference of class prototypes in a hierarchical Bayesian framework. The semantic memory is grown from scratch and gradually consolidated by absorbing information from tasks it experiences. By doing so, it is able to accumulate long-term, general knowledge that enables it to learn new concepts of objects. We formulate memory recall as the variational inference of a latent memory variable from addressed contents, which offers a principled way to adapt the knowledge to individual tasks. Our variational semantic memory, as a new long-term memory module, confers principled recall and update mechanisms that enable semantic information to be efficiently accrued and adapted for few-shot learning. Experiments demonstrate that the probabilistic modelling of prototypes achieves a more informative representation of object classes compared to deterministic vectors. The consistent new state-of-the-art performance on four benchmarks shows the benefit of variational semantic memory in boosting few-shot recognition.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper, we introduce variational semantic memory into meta-learning to acquire long-term knowledge for few-shot learning. The variational semantic memory accrues and stores semantic information for the probabilistic inference of class prototypes in a hierarchical Bayesian framework. The semantic memory is grown from scratch and gradually consolidated by absorbing information from tasks it experiences. By doing so, it is able to accumulate long-term, general knowledge that enables it to learn new concepts of objects. We formulate memory recall as the variational inference of a latent memory variable from addressed contents, which offers a principled way to adapt the knowledge to individual tasks. Our variational semantic memory, as a new long-term memory module, confers principled recall and update mechanisms that enable semantic information to be efficiently accrued and adapted for few-shot learning. Experiments demonstrate that the probabilistic modelling of prototypes achieves a more informative representation of object classes compared to deterministic vectors. The consistent new state-of-the-art performance on four benchmarks shows the benefit of variational semantic memory in boosting few-shot recognition. |
| William Thong, Pascal Mettes, Cees G M Snoek: Open Cross-Domain Visual Search. In: Computer Vision and Image Understanding, vol. 200, 2020. @article{ThongCVIU20,
title = {Open Cross-Domain Visual Search},
author = {William Thong and Pascal Mettes and Cees G M Snoek},
url = {https://doi.org/10.1016/j.cviu.2020.103045
https://github.com/twuilliam/open-search},
year = {2020},
date = {2020-11-01},
urldate = {2020-11-01},
journal = {Computer Vision and Image Understanding},
volume = {200},
abstract = {This paper addresses cross-domain visual search, where visual queries retrieve category samples from a different domain. For example, we may want to sketch an airplane and retrieve photographs of airplanes. Despite considerable progress, the search occurs in a closed setting between two pre-defined domains. In this paper, we make the step towards an open setting where multiple visual domains are available. This notably translates into a search between any pair of domains, from a combination of domains or within multiple domains. We introduce a simple ? yet effective ? approach. We formulate the search as a mapping from every visual domain to a common semantic space, where categories are represented by hyperspherical prototypes. Open cross-domain visual search is then performed by searching in the common semantic space, regardless of which domains are used as source or target. Domains are combined in the common space to search from or within multiple domains simultaneously. A separate training of every domain-specific mapping function enables an efficient scaling to any number of domains without affecting the search performance. We empirically illustrate our capability to perform open cross-domain visual search in three different scenarios. Our approach is competitive with respect to existing closed settings, where we obtain state-of-the-art results on several benchmarks for three sketch-based search tasks.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
This paper addresses cross-domain visual search, where visual queries retrieve category samples from a different domain. For example, we may want to sketch an airplane and retrieve photographs of airplanes. Despite considerable progress, the search occurs in a closed setting between two pre-defined domains. In this paper, we make the step towards an open setting where multiple visual domains are available. This notably translates into a search between any pair of domains, from a combination of domains or within multiple domains. We introduce a simple ? yet effective ? approach. We formulate the search as a mapping from every visual domain to a common semantic space, where categories are represented by hyperspherical prototypes. Open cross-domain visual search is then performed by searching in the common semantic space, regardless of which domains are used as source or target. Domains are combined in the common space to search from or within multiple domains simultaneously. A separate training of every domain-specific mapping function enables an efficient scaling to any number of domains without affecting the search performance. We empirically illustrate our capability to perform open cross-domain visual search in three different scenarios. Our approach is competitive with respect to existing closed settings, where we obtain state-of-the-art results on several benchmarks for three sketch-based search tasks. |
| William Thong, Cees G M Snoek: Bias-Awareness for Zero-Shot Learning the Seen and Unseen. In: BMVC, Manchester, UK, 2020. @inproceedings{ThongBMVC20,
title = {Bias-Awareness for Zero-Shot Learning the Seen and Unseen},
author = {William Thong and Cees G M Snoek},
url = {http://isis-data.science.uva.nl/cgmsnoek/pub/thong-bias-bmvc2020.pdf
https://www.bmvc2020-conference.com/conference/papers/paper_0261.html
https://github.com/twuilliam/bias-gzsl},
year = {2020},
date = {2020-09-01},
booktitle = {BMVC},
address = {Manchester, UK},
abstract = {Generalized zero-shot learning recognizes inputs from both seen and unseen classes. Yet, existing methods tend to be biased towards the classes seen during training. In this paper, we strive to mitigate this bias. We propose a bias-aware learner to map inputs to a semantic embedding space for generalized zero-shot learning. During training, the model learns to regress to real-valued class prototypes in the embedding space with temperature scaling, while a margin-based bidirectional entropy term regularizes seen and unseen probabilities. Relying on a real-valued semantic embedding space provides a versatile approach, as the model can operate on different types of semantic information for both seen and unseen classes. Experiments are carried out on four benchmarks for generalized zero-shot learning and demonstrate the benefits of the proposed bias-aware classifier, both as a stand-alone method or in combination with generated features.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Generalized zero-shot learning recognizes inputs from both seen and unseen classes. Yet, existing methods tend to be biased towards the classes seen during training. In this paper, we strive to mitigate this bias. We propose a bias-aware learner to map inputs to a semantic embedding space for generalized zero-shot learning. During training, the model learns to regress to real-valued class prototypes in the embedding space with temperature scaling, while a margin-based bidirectional entropy term regularizes seen and unseen probabilities. Relying on a real-valued semantic embedding space provides a versatile approach, as the model can operate on different types of semantic information for both seen and unseen classes. Experiments are carried out on four benchmarks for generalized zero-shot learning and demonstrate the benefits of the proposed bias-aware classifier, both as a stand-alone method or in combination with generated features. |
| Yunlu Chen, Vincent Tao Hu, Efstratios Gavves, Thomas Mensink, Pascal Mettes, Pengwan Yang, Cees G M Snoek: PointMixup: Augmentation for Point Clouds. In: ECCV, Glasgow, UK, 2020, (Spotlight presentation, top 5%). @inproceedings{ChenECCV20,
title = {PointMixup: Augmentation for Point Clouds},
author = {Yunlu Chen and Vincent Tao Hu and Efstratios Gavves and Thomas Mensink and Pascal Mettes and Pengwan Yang and Cees G M Snoek},
url = {http://isis-data.science.uva.nl/cgmsnoek/pub/chen-pointmixup-eccv2020.pdf
https://github.com/yunlu-chen/PointMixup/},
year = {2020},
date = {2020-08-01},
booktitle = {ECCV},
address = {Glasgow, UK},
abstract = {This paper introduces data augmentation for point clouds by interpolation between examples. Data augmentation by interpolation has shown to be a simple and effective approach in the image domain. Such a mixup is however not directly transferable to point clouds, as we do not have a one-to-one correspondence between the points of two different objects. In this paper, we define data augmentation between point clouds as a shortest path linear interpolation. To that end, we introduce PointMixup, an interpolation method that generates new examples through an optimal assignment of the path function between two point clouds. We prove that our PointMixup finds the shortest path between two point clouds and that the interpolation is assignment invariant and linear. With the definition of interpolation, PointMixup allows to introduce strong interpolation-based regularizers such as mixup and manifold mixup to the point cloud domain. Experimentally, we show the potential of PointMixup for point cloud classification, especially when examples are scarce, as well as increased robustness to noise and geometric transformations to points. The code for PointMixup and the experimental details are publicly available.},
note = {Spotlight presentation, top 5%},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper introduces data augmentation for point clouds by interpolation between examples. Data augmentation by interpolation has shown to be a simple and effective approach in the image domain. Such a mixup is however not directly transferable to point clouds, as we do not have a one-to-one correspondence between the points of two different objects. In this paper, we define data augmentation between point clouds as a shortest path linear interpolation. To that end, we introduce PointMixup, an interpolation method that generates new examples through an optimal assignment of the path function between two point clouds. We prove that our PointMixup finds the shortest path between two point clouds and that the interpolation is assignment invariant and linear. With the definition of interpolation, PointMixup allows to introduce strong interpolation-based regularizers such as mixup and manifold mixup to the point cloud domain. Experimentally, we show the potential of PointMixup for point cloud classification, especially when examples are scarce, as well as increased robustness to noise and geometric transformations to points. The code for PointMixup and the experimental details are publicly available. |
| Yingjun Du, Jun Xu, Huan Xiong, Qiang Qiu, Xiantong Zhen, Cees G M Snoek, Ling Shao: Learning to Learn with Variational Information Bottleneck for Domain Generalization. In: ECCV, Glasgow, UK, 2020. @inproceedings{DuECCV20,
title = {Learning to Learn with Variational Information Bottleneck for Domain Generalization},
author = {Yingjun Du and Jun Xu and Huan Xiong and Qiang Qiu and Xiantong Zhen and Cees G M Snoek and Ling Shao},
url = {https://arxiv.org/abs/2007.07645},
year = {2020},
date = {2020-08-01},
booktitle = {ECCV},
address = {Glasgow, UK},
abstract = {Domain generalization models learn to generalize to previously unseen domains, but suffer from prediction uncertainty and domain shift. In this paper, we address both problems. We introduce a probabilistic meta-learning model for domain generalization, in which classifier parameters shared across domains are modeled as distributions. This enables better handling of prediction uncertainty on unseen domains. To deal with domain shift, we learn domain-invariant representations by the proposed principle of meta variational information bottleneck, we call MetaVIB. MetaVIB is derived from novel variational bounds of mutual information, by leveraging the meta-learning setting of domain generalization. Through episodic training, MetaVIB learns to gradually narrow domain gaps to establish domain-invariant representations, while simultaneously maximizing prediction accuracy. We conduct experiments on three benchmarks for cross-domain visual recognition. Comprehensive ablation studies validate the benefits of MetaVIB for domain generalization. The comparison results demonstrate our method outperforms previous approaches consistently.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Domain generalization models learn to generalize to previously unseen domains, but suffer from prediction uncertainty and domain shift. In this paper, we address both problems. We introduce a probabilistic meta-learning model for domain generalization, in which classifier parameters shared across domains are modeled as distributions. This enables better handling of prediction uncertainty on unseen domains. To deal with domain shift, we learn domain-invariant representations by the proposed principle of meta variational information bottleneck, we call MetaVIB. MetaVIB is derived from novel variational bounds of mutual information, by leveraging the meta-learning setting of domain generalization. Through episodic training, MetaVIB learns to gradually narrow domain gaps to establish domain-invariant representations, while simultaneously maximizing prediction accuracy. We conduct experiments on three benchmarks for cross-domain visual recognition. Comprehensive ablation studies validate the benefits of MetaVIB for domain generalization. The comparison results demonstrate our method outperforms previous approaches consistently. |
| Sanath Narayan, Akshita Gupta, Fahad Khan, Cees G M Snoek, Ling Shao: Latent Embedding Feedback and Discriminative Features for Zero-Shot Classification. In: ECCV, Glasgow, UK, 2020. @inproceedings{NarayanECCV20,
title = {Latent Embedding Feedback and Discriminative Features for Zero-Shot Classification},
author = {Sanath Narayan and Akshita Gupta and Fahad Khan and Cees G M Snoek and Ling Shao},
url = {https://arxiv.org/abs/2003.07833
https://akshitac8.github.io/tfvaegan/
https://github.com/akshitac8/tfvaegan},
year = {2020},
date = {2020-08-01},
booktitle = {ECCV},
address = {Glasgow, UK},
abstract = {Zero-shot learning strives to classify unseen categories for which no data is available during training. In the generalized variant, the test samples can further belong to seen or unseen categories. The state-of-the-art relies on Generative Adversarial Networks that synthesize unseen class features by leveraging class-specific semantic embeddings. During training, they generate semantically consistent features, but discard this constraint during feature synthesis and classification. We propose to enforce semantic consistency at all stages of (generalized) zero-shot learning: training, feature synthesis and classification. We further introduce a feedback loop, from a semantic embedding decoder, that iteratively refines the generated features during both the training and feature synthesis stages. The synthesized features together with their corresponding latent embeddings from the decoder are transformed into discriminative features and utilized during classification to reduce ambiguities among categories. Experiments on (generalized) zero-shot learning for object and action classification reveal the benefit of semantic consistency and iterative feedback, outperforming existing methods on six zero-shot learning benchmarks.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Zero-shot learning strives to classify unseen categories for which no data is available during training. In the generalized variant, the test samples can further belong to seen or unseen categories. The state-of-the-art relies on Generative Adversarial Networks that synthesize unseen class features by leveraging class-specific semantic embeddings. During training, they generate semantically consistent features, but discard this constraint during feature synthesis and classification. We propose to enforce semantic consistency at all stages of (generalized) zero-shot learning: training, feature synthesis and classification. We further introduce a feedback loop, from a semantic embedding decoder, that iteratively refines the generated features during both the training and feature synthesis stages. The synthesized features together with their corresponding latent embeddings from the decoder are transformed into discriminative features and utilized during classification to reduce ambiguities among categories. Experiments on (generalized) zero-shot learning for object and action classification reveal the benefit of semantic consistency and iterative feedback, outperforming existing methods on six zero-shot learning benchmarks. |
| Pengwan Yang, Vincent Tao Hu, Pascal Mettes, Cees G M Snoek: Localizing the Common Action Among a Few Videos. In: ECCV, Glasgow, UK, 2020. @inproceedings{YangECCV20,
title = {Localizing the Common Action Among a Few Videos},
author = {Pengwan Yang and Vincent Tao Hu and Pascal Mettes and Cees G M Snoek},
url = {http://isis-data.science.uva.nl/cgmsnoek/pub/yang-common-action-eccv2020.pdf
https://github.com/PengWan-Yang/commonLocalization},
year = {2020},
date = {2020-08-01},
booktitle = {ECCV},
address = {Glasgow, UK},
abstract = {This paper strives to localize the temporal extent of an action in a long untrimmed video. Where existing work leverages many examples with their start, their ending, and/or the class of the action during training time, we propose few-shot common action localization. The start and end of an action in a long untrimmed video is determined based on just a hand-full of trimmed video examples containing the same action, without knowing their common class label. To address this task, we introduce a new 3D convolutional network architecture able to align representations from the support videos with the relevant query video segments. The network contains: (i) a mutual enhancement module to simultaneously complement the representation of the few trimmed support videos and the untrimmed query video; (ii) a progressive alignment module that iteratively fuses the support videos into the query branch; and (iii) a pairwise matching module to weigh the importance of different support videos. Evaluation of few-shot common action localization in untrimmed videos containing a single or multiple action instances demonstrates the effectiveness and general applicability of our proposal.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper strives to localize the temporal extent of an action in a long untrimmed video. Where existing work leverages many examples with their start, their ending, and/or the class of the action during training time, we propose few-shot common action localization. The start and end of an action in a long untrimmed video is determined based on just a hand-full of trimmed video examples containing the same action, without knowing their common class label. To address this task, we introduce a new 3D convolutional network architecture able to align representations from the support videos with the relevant query video segments. The network contains: (i) a mutual enhancement module to simultaneously complement the representation of the few trimmed support videos and the untrimmed query video; (ii) a progressive alignment module that iteratively fuses the support videos into the query branch; and (iii) a pairwise matching module to weigh the importance of different support videos. Evaluation of few-shot common action localization in untrimmed videos containing a single or multiple action instances demonstrates the effectiveness and general applicability of our proposal. |
| Xiantong Zhen, Haoliang Sun, Yingjun Du, Jun Xu, Yilong Yin, Ling Shao, Cees G M Snoek: Learning to Learn Kernels with Variational Random Features. In: ICML, Vienna, Austria, 2020. @inproceedings{ZhengICML20,
title = {Learning to Learn Kernels with Variational Random Features},
author = {Xiantong Zhen and Haoliang Sun and Yingjun Du and Jun Xu and Yilong Yin and Ling Shao and Cees G M Snoek},
url = {https://arxiv.org/abs/2006.06707
https://github.com/Yingjun-Du/MetaVRF},
year = {2020},
date = {2020-07-01},
booktitle = {ICML},
address = {Vienna, Austria},
abstract = {In this work, we introduce kernels with random Fourier features in the meta-learning framework to leverage their strong few-shot learning ability. We propose meta variational random features (MetaVRF) to learn adaptive kernels for the base-learner, which is developed in a latent variable model by treating the random feature basis as the latent variable. We formulate the optimization of MetaVRF as a variational inference problem by deriving an evidence lower bound under the meta-learning framework. To incorporate shared knowledge from related tasks, we propose a context inference of the posterior, which is established by an LSTM architecture. The LSTM-based inference network can effectively integrate the context information of previous tasks with task-specific information, generating informative and adaptive features. The learned MetaVRF can produce kernels of high representational power with a relatively low spectral sampling rate and also enables fast adaptation to new tasks. Experimental results on a variety of few-shot regression and classification tasks demonstrate that MetaVRF delivers much better, or at least competitive, performance compared to existing meta-learning alternatives.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this work, we introduce kernels with random Fourier features in the meta-learning framework to leverage their strong few-shot learning ability. We propose meta variational random features (MetaVRF) to learn adaptive kernels for the base-learner, which is developed in a latent variable model by treating the random feature basis as the latent variable. We formulate the optimization of MetaVRF as a variational inference problem by deriving an evidence lower bound under the meta-learning framework. To incorporate shared knowledge from related tasks, we propose a context inference of the posterior, which is established by an LSTM architecture. The LSTM-based inference network can effectively integrate the context information of previous tasks with task-specific information, generating informative and adaptive features. The learned MetaVRF can produce kernels of high representational power with a relatively low spectral sampling rate and also enables fast adaptation to new tasks. Experimental results on a variety of few-shot regression and classification tasks demonstrate that MetaVRF delivers much better, or at least competitive, performance compared to existing meta-learning alternatives. |
| Yonghong Tian, Cees G M Snoek, Jingdong Wang, Zhu Liu, Rainer Lienhart, Susanne Boll: Guest Editorial Multimedia Computing With Interpretable Machine Learning. In: IEEE Transactions on Multimedia, vol. 22, no. 7, pp. 1661-1666, 2020. @article{TianTMM20,
title = {Guest Editorial Multimedia Computing With Interpretable Machine Learning},
author = {Yonghong Tian and Cees G M Snoek and Jingdong Wang and Zhu Liu and Rainer Lienhart and Susanne Boll},
year = {2020},
date = {2020-07-01},
journal = {IEEE Transactions on Multimedia},
volume = {22},
number = {7},
pages = {1661-1666},
abstract = {The papers in this special section is to broadly engage the machine learning and multimedia communities on the emerging yet challenging interpretable machine learning. Multimedia is increasingly becoming the “biggest big data,” among the most important and valuable source for insight and information. Many powerful machine learning algorithms, especially deep learning models such as convolutional neural networks (CNNs), have recently achieved outstanding predictive performance in a wide range of multimedia applications, including visual object classification, scene understanding, speech recognition, and activity prediction. Nevertheless, most deep learning algorithms are generally conceived as blackbox methods, and it is difficult to intuitively and quantitatively understand the results of their prediction and inference. Since this lack of interpretability is a major bottleneck in designing more successful predictive models and exploring wider-range useful applications, there has been an explosion of interest in interpreting the representations learned by these models, with profound implications for research into interpretable machine learning in the multimedia community.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
The papers in this special section is to broadly engage the machine learning and multimedia communities on the emerging yet challenging interpretable machine learning. Multimedia is increasingly becoming the “biggest big data,” among the most important and valuable source for insight and information. Many powerful machine learning algorithms, especially deep learning models such as convolutional neural networks (CNNs), have recently achieved outstanding predictive performance in a wide range of multimedia applications, including visual object classification, scene understanding, speech recognition, and activity prediction. Nevertheless, most deep learning algorithms are generally conceived as blackbox methods, and it is difficult to intuitively and quantitatively understand the results of their prediction and inference. Since this lack of interpretability is a major bottleneck in designing more successful predictive models and exploring wider-range useful applications, there has been an explosion of interest in interpreting the representations learned by these models, with profound implications for research into interpretable machine learning in the multimedia community. |
| Kirill Gavrilyuk, Ryan Sanford, Mehrsan Javan, Cees G M Snoek: Actor-Transformers for Group Activity Recognition. In: CVPR, Seattle, USA, 2020. @inproceedings{GavrilyukCVPR20,
title = {Actor-Transformers for Group Activity Recognition},
author = {Kirill Gavrilyuk and Ryan Sanford and Mehrsan Javan and Cees G M Snoek},
url = {http://isis-data.science.uva.nl/cgmsnoek/pub/gavrilyuk-transformers-cvpr2020.pdf},
year = {2020},
date = {2020-06-01},
booktitle = {CVPR},
address = {Seattle, USA},
abstract = {This paper strives to recognize individual actions and group activities from videos. While existing solutions for this challenging problem explicitly model spatial and temporal relationships based on location of individual actors, we propose an actor-transformer model able to learn and selectively extract information relevant for group activity recognition. We feed the transformer with rich actor-specific static and dynamic representations expressed by features from a 2D pose network and 3D CNN, respectively. We empirically study different ways to combine these representations and show their complementary benefits. Experiments show what is important to transform and how it should be transformed. What is more, actor-transformers achieve state-of-the-art results on two publicly available benchmarks for group activity recognition, outperforming the previous best published results by a considerable margin.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper strives to recognize individual actions and group activities from videos. While existing solutions for this challenging problem explicitly model spatial and temporal relationships based on location of individual actors, we propose an actor-transformer model able to learn and selectively extract information relevant for group activity recognition. We feed the transformer with rich actor-specific static and dynamic representations expressed by features from a 2D pose network and 3D CNN, respectively. We empirically study different ways to combine these representations and show their complementary benefits. Experiments show what is important to transform and how it should be transformed. What is more, actor-transformers achieve state-of-the-art results on two publicly available benchmarks for group activity recognition, outperforming the previous best published results by a considerable margin. |
| Mihir Jain, Amir Ghodrati, Cees G M Snoek: ActionBytes: Learning from Trimmed Videos to Localize Actions. In: CVPR, Seattle, USA, 2020. @inproceedings{JainCVPR20,
title = {ActionBytes: Learning from Trimmed Videos to Localize Actions},
author = {Mihir Jain and Amir Ghodrati and Cees G M Snoek},
url = {http://isis-data.science.uva.nl/cgmsnoek/pub/jain-actionbytes-cvpr2020.pdf},
year = {2020},
date = {2020-06-01},
booktitle = {CVPR},
address = {Seattle, USA},
abstract = {This paper tackles the problem of localizing actions in long untrimmed videos. Different from existing works, which all use annotated untrimmed videos during training, we learn only from short trimmed videos. This enables learning from large-scale datasets originally designed for action classification. We propose a method to train an action localization network that segments a video into interpretable fragments, we call ActionBytes. Our method jointly learns to cluster ActionBytes and trains the localization network using the cluster assignments as pseudo-labels. By doing so, we train on short trimmed videos that become untrimmed for ActionBytes. In isolation, or when merged, the ActionBytes also serve as effective action proposals. Experiments demonstrate that our boundary-guided training generalizes to unknown action classes and localizes actions in long videos of Thumos14, MultiThumos, and ActivityNet1.2. Furthermore, we show the advantage of ActionBytes for zero-shot localization as well as traditional weakly supervised localization, that train on long videos, to achieve state-of-the-art results.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper tackles the problem of localizing actions in long untrimmed videos. Different from existing works, which all use annotated untrimmed videos during training, we learn only from short trimmed videos. This enables learning from large-scale datasets originally designed for action classification. We propose a method to train an action localization network that segments a video into interpretable fragments, we call ActionBytes. Our method jointly learns to cluster ActionBytes and trains the localization network using the cluster assignments as pseudo-labels. By doing so, we train on short trimmed videos that become untrimmed for ActionBytes. In isolation, or when merged, the ActionBytes also serve as effective action proposals. Experiments demonstrate that our boundary-guided training generalizes to unknown action classes and localizes actions in long videos of Thumos14, MultiThumos, and ActivityNet1.2. Furthermore, we show the advantage of ActionBytes for zero-shot localization as well as traditional weakly supervised localization, that train on long videos, to achieve state-of-the-art results. |
| Teng Long, Pascal Mettes, Heng Tao Shen, Cees G M Snoek: Searching for Actions on the Hyperbole. In: CVPR, Seattle, USA, 2020. @inproceedings{LongCVPR20,
title = {Searching for Actions on the Hyperbole},
author = {Teng Long and Pascal Mettes and Heng Tao Shen and Cees G M Snoek},
url = {http://isis-data.science.uva.nl/cgmsnoek/pub/long-hyperbole-cvpr2020.pdf
https://github.com/Tenglon/hyperbolic_action},
year = {2020},
date = {2020-06-01},
booktitle = {CVPR},
address = {Seattle, USA},
abstract = {In this paper, we introduce hierarchical action search. Starting from the observation that hierarchies are mostly ignored in the action literature, we retrieve not only individual actions but also relevant and related actions, given an action name or video example as input. We propose a hyperbolic action network, which is centered around a hyperbolic space shared by action hierarchies and videos. Our discriminative hyperbolic embedding projects actions on the shared space while jointly optimizing hypernym-hyponym relations between action pairs and a large margin separation between all actions. The projected actions serve as hyperbolic prototypes that we match with projected video representations. The result is a learned space where videos are positioned in entailment cones formed by different subtrees. To perform search in this space, we start from a query and increasingly enlarge its entailment cone to retrieve hierarchically relevant action videos. Experiments on three action datasets with new hierarchy annotations show the effectiveness of our approach for hierarchical action search by name and by video example, regardless of whether queried actions have been seen or not during training. Our implementation is available at https://github.com/Tenglon/hyperbolic_action},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper, we introduce hierarchical action search. Starting from the observation that hierarchies are mostly ignored in the action literature, we retrieve not only individual actions but also relevant and related actions, given an action name or video example as input. We propose a hyperbolic action network, which is centered around a hyperbolic space shared by action hierarchies and videos. Our discriminative hyperbolic embedding projects actions on the shared space while jointly optimizing hypernym-hyponym relations between action pairs and a large margin separation between all actions. The projected actions serve as hyperbolic prototypes that we match with projected video representations. The result is a learned space where videos are positioned in entailment cones formed by different subtrees. To perform search in this space, we start from a query and increasingly enlarge its entailment cone to retrieve hierarchically relevant action videos. Experiments on three action datasets with new hierarchy annotations show the effectiveness of our approach for hierarchical action search by name and by video example, regardless of whether queried actions have been seen or not during training. Our implementation is available at https://github.com/Tenglon/hyperbolic_action |
| Tom Runia, Kirill Gavrilyuk, Cees G M Snoek, Arnold W M Smeulders: Cloth in the Wind: A Case Study of Physical Measurement through Simulation. In: CVPR, Seattle, USA, 2020. @inproceedings{RuniaCVPR20,
title = {Cloth in the Wind: A Case Study of Physical Measurement through Simulation},
author = {Tom Runia and Kirill Gavrilyuk and Cees G M Snoek and Arnold W M Smeulders},
url = {http://isis-data.science.uva.nl/cgmsnoek/pub/runia-cloth-cvpr2020.pdf
https://tomrunia.github.io/projects/cloth/},
year = {2020},
date = {2020-06-01},
booktitle = {CVPR},
address = {Seattle, USA},
abstract = {For many of the physical phenomena around us, we have developed sophisticated models explaining their behavior. Nevertheless, measuring physical properties from
visual observations is challenging due to the high number of causally underlying physical parameters - including material properties and external forces. In this paper,
we propose to measure latent physical properties for cloth in the wind without ever having seen a real example before. Our solution is an iterative refinement procedure with simulation at its core. The algorithm gradually updates the physical model parameters by running a simulation of the observed phenomenon and comparing the current simulation to a real-world observation. The correspondence is measured using an embedding function that maps physically similar examples to nearby points. We consider a case study of cloth in the wind, with curling flags as our leading example - a seemingly simple phenomena but physically highly involved. Based on the physics of cloth and its visual manifestation, we propose an instantiation of the embedding function. For this mapping, modeled as a deep network, we introduce a spectral layer that decomposes a video volume into its temporal spectral power and corresponding frequencies. Our experiments demonstrate that the proposed method compares favorably to prior work on the task of measuring cloth material properties and external wind force from a real-world video.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
For many of the physical phenomena around us, we have developed sophisticated models explaining their behavior. Nevertheless, measuring physical properties from
visual observations is challenging due to the high number of causally underlying physical parameters - including material properties and external forces. In this paper,
we propose to measure latent physical properties for cloth in the wind without ever having seen a real example before. Our solution is an iterative refinement procedure with simulation at its core. The algorithm gradually updates the physical model parameters by running a simulation of the observed phenomenon and comparing the current simulation to a real-world observation. The correspondence is measured using an embedding function that maps physically similar examples to nearby points. We consider a case study of cloth in the wind, with curling flags as our leading example - a seemingly simple phenomena but physically highly involved. Based on the physics of cloth and its visual manifestation, we propose an instantiation of the embedding function. For this mapping, modeled as a deep network, we introduce a spectral layer that decomposes a video volume into its temporal spectral power and corresponding frequencies. Our experiments demonstrate that the proposed method compares favorably to prior work on the task of measuring cloth material properties and external wind force from a real-world video. |
| Shuo Chen, Pascal Mettes, Tao Hu, Cees G M Snoek: Interactivity Proposals for Surveillance Videos. In: ICMR, Dublin, Ireland, 2020. @inproceedings{ChenICMR20,
title = {Interactivity Proposals for Surveillance Videos},
author = {Shuo Chen and Pascal Mettes and Tao Hu and Cees G M Snoek},
url = {http://isis-data.science.uva.nl/cgmsnoek/pub/chen-interactivity-icmr2020.pdf
https://github.com/shanshuo/Interactivity_Proposals},
year = {2020},
date = {2020-06-01},
booktitle = {ICMR},
address = {Dublin, Ireland},
abstract = {This paper introduces spatio-temporal interactivity proposals for video surveillance. Rather than focusing solely on actions performed by subjects, we explicitly include the objects that the subjects interact with. To enable interactivity proposals, we introduce the notion of interactivityness, a score that reflects the likelihood that a subject and object have an interplay. For its estimation, we propose a network containing an interactivity block and geometric encoding between subjects and objects. The network computes local interactivity likelihoods from subject and object trajectories, which we use to link intervals of high scores into spatio-temporal proposals. Experiments on an interactivity dataset with new evaluation metrics show the general benefit of interactivity proposals as well as its favorable performance compared to traditional temporal and spatio-temporal action proposals.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper introduces spatio-temporal interactivity proposals for video surveillance. Rather than focusing solely on actions performed by subjects, we explicitly include the objects that the subjects interact with. To enable interactivity proposals, we introduce the notion of interactivityness, a score that reflects the likelihood that a subject and object have an interplay. For its estimation, we propose a network containing an interactivity block and geometric encoding between subjects and objects. The network computes local interactivity likelihoods from subject and object trajectories, which we use to link intervals of high scores into spatio-temporal proposals. Experiments on an interactivity dataset with new evaluation metrics show the general benefit of interactivity proposals as well as its favorable performance compared to traditional temporal and spatio-temporal action proposals. |
| Pascal Mettes, Dennis C Koelma, Cees G M Snoek: Shuffled ImageNet-Banks for Video Event Detection and Search. In: ACM Transactions on Multimedia Computing, Communications and Applications, vol. 16, no. 2, 2020. @article{MettesTOMCCAP20,
title = {Shuffled ImageNet-Banks for Video Event Detection and Search},
author = {Pascal Mettes and Dennis C Koelma and Cees G M Snoek},
url = {https://dl.acm.org/doi/pdf/10.1145/3377875
https://github.com/psmmettes/shuffled-imagenet-bank},
year = {2020},
date = {2020-05-01},
journal = {ACM Transactions on Multimedia Computing, Communications and Applications},
volume = {16},
number = {2},
abstract = {This paper aims for the detection and search of events in videos, where video examples are either scarce or even absent during training. To enable such event detection and search, ImageNet concept banks have shown to be effective. Rather than employing the standard concept bank of 1,000 ImageNet classes, we leverage the full 21,841-class dataset. We identify two problems with using the full dataset: (i) there is an imbalance between the number of examples per concept and (ii) not all concepts are equally relevant for events. In this paper, we propose to balance large-scale image hierarchies for pre-training. We shuffle concepts based on bottom-up and top-down operations to overcome the problems of example imbalance and concept relevance. Using this strategy, we arrive at the shuffled ImageNet-bank, a concept bank with an order of magnitude more concepts compared to standard ImageNet banks. For event detection, this results in more discriminative representations to train event models from the limited video event examples provided during training. For event search, the broad range of concepts enable a closer match between textual queries of events and concept detections in videos. Experimentally, we show the benefit of the proposed bank for event detection and event search, with state-of-the-art performance for both tasks on the challenging TRECVID Multimedia Event Detection and Ad-Hoc Video Search benchmarks.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
This paper aims for the detection and search of events in videos, where video examples are either scarce or even absent during training. To enable such event detection and search, ImageNet concept banks have shown to be effective. Rather than employing the standard concept bank of 1,000 ImageNet classes, we leverage the full 21,841-class dataset. We identify two problems with using the full dataset: (i) there is an imbalance between the number of examples per concept and (ii) not all concepts are equally relevant for events. In this paper, we propose to balance large-scale image hierarchies for pre-training. We shuffle concepts based on bottom-up and top-down operations to overcome the problems of example imbalance and concept relevance. Using this strategy, we arrive at the shuffled ImageNet-bank, a concept bank with an order of magnitude more concepts compared to standard ImageNet banks. For event detection, this results in more discriminative representations to train event models from the limited video event examples provided during training. For event search, the broad range of concepts enable a closer match between textual queries of events and concept detections in videos. Experimentally, we show the benefit of the proposed bank for event detection and event search, with state-of-the-art performance for both tasks on the challenging TRECVID Multimedia Event Detection and Ad-Hoc Video Search benchmarks. |