The ICMR2016 paper “The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection” by Pascal Mettes, Dennis Koelma and Cees Snoek is now available. This paper strives for video event detection using a representation learned from deep convolutional neural networks. Different from the leading approaches, who all learn from the 1,000 classes defined in the ImageNet Large Scale Visual Recognition Challenge, we investigate how to leverage the complete ImageNet hierarchy for pre-training deep networks. To deal with the problems of over-specific classes and classes with few images, we introduce a bottom-up and top-down approach for reorganization of the ImageNet hierarchy based on all its 21,814 classes and more than 14 million images. Experiments on the TRECVID Multimedia Event Detection 2013 and 2015 datasets show that video representations derived from the layers of a deep neural network pre-trained with our reorganized hierarchy i) improves over standard pre-training, ii) is complementary among different reorganizations, iii) maintains the benefits of fusion with other modalities, and iv) leads to state-of-the-art event detection results. The reorganized hierarchies and their derived Caffe models are publicly available at http://tinyurl.com/imagenetshuffle.
The ICCV 2015 paper Objects2action: Classifying and localizing actions without any video example by Mihir Jain, Jan van Gemert, Thomas Mensink and Cees Snoek is now available. The goal of this paper is to recognize actions in video without the need for examples. Different from traditional zero-shot approaches we do not demand the design and specification of attribute classifiers and class-to-attribute mappings to allow for transfer from seen classes to unseen classes. Our key contribution is objects2action, a semantic word embedding that is spanned by a skip-gram model of thousands of object categories. Action labels are assigned to an object encoding of unseen video based on a convex combination of action and object affinities. Our semantic embedding has three main characteristics to accommodate for the specifics of actions. First, we propose a mechanism to exploit multiple-word descriptions of actions and objects. Second, we incorporate the automated selection of the most responsive objects per action. And finally, we demonstrate how to extend our zero-shot approach to the spatio-temporal localization of actions in video. Experiments on four action datasets demonstrate the potential of our approach.
The ICCV 2015 paper Active Transfer Learning with Zero-Shot Priors: Reusing Past Datasets for Future Tasks by Efstratios Gavves, Thomas Mensink, Tatiana Tommasi, Cees Snoek and Tinne Tuytelaars is now available. How can we reuse existing knowledge, in the form of available datasets, when solving a new and apparently unrelated target task from a set of unlabeled data? In this work we make a first contribution to answer this question in the context of image classification. We frame this quest as an active learning problem and use zero-shot classifiers to guide the learning process by linking the new task to the existing classifiers. By revisiting the dual formulation of adaptive SVM, we reveal two basic conditions to choose greedily only the most relevant samples to be annotated. On this basis we propose an effective active learning algorithm which learns the best possible target classification model with minimum human labeling effort. Extensive experiments on two challenging datasets show the value of our approach compared to the state-of-the-art active learning methodologies, as well as its potential to reuse past datasets with minimal effort for future tasks.
The ACM Multimedia paper Image2Emoji: Zero-shot Emoji Prediction for Visual Media by Spencer Cappallo, Thomas Mensink, and Cees Snoek is now available. We present Image2Emoji, a multi-modal approach for generating emoji labels for an image in a zero-shot manner. Different from existing zero-shot image-to-text approaches, we exploit both image and textual media to learn a semantic embedding for the new task of emoji prediction. We propose that the widespread adoption of emoji suggests a semantic universality which is well-suited for interaction with visual media. We quantify the efficacy of our proposed model on the MSCOCO dataset, and demonstrate the value of visual, textual and multi-modal prediction of emoji. We conclude the paper with three examples of the application potential of emoji in the context of multimedia retrieval.
The BMVC2015 paper entitled APT: Action localization proposals from dense trajectories by Jan van Gemert, Mihir Jain, Ella Gati and Cees Snoek is now available. This paper is on action localization in video with the aid of spatio-temporal proposals. To alleviate the computational expensive segmentation step of existing proposals, we propose bypassing the segmentations completely by generating proposals directly from the dense trajectories used to represent videos during classification. Our Action localization Proposals from dense Trajectories (APT) use an efficient proposal generation algorithm to handle the high number of trajectories in a video. Our spatio-temporal proposals are faster than current methods and outperform the localization and classification accuracy of current proposals on the UCF Sports, UCF 101, and MSR-II video datasets.
The BMVC2015 paper entitled Event Fisher Vectors: Robust Encoding Visual Diversity of Visual Streams by Markus Nagel, Thomas Mensink and Cees Snoek is now available. In this paper we focus on event recognition in visual image streams. More specifically, we aim to construct a compact representation which encodes the diversity of the visual stream from just a few observations. For this purpose, we introduce the Event Fisher Vector, a Fisher Kernel based representation to describe a collection of images or the sequential frames of a video. We explore different generative models beyond the Gaussian mixture model as underlying probability distribution. First, the Student?s-t mixture model which captures the heavy tails of the small sample size of a collection of images. Second, Hidden Markov Models to explicitly capture the temporal ordering of the observations in a stream. For all our models we derive analytical approximations of the Fisher information matrix, which significantly improves recognition performance. We extensively evaluate the properties of our proposed method on three recent datasets for event recognition in photo collections and web videos, leading to an efficient compact image representation which achieves state-of-the-art performance on all these datasets.
This summer Qualcomm, the world-leader in mobile chip-design, and the University of Amsterdam, a world-leading computer science department, have started a joint research lab in Amsterdam, the Netherlands, as a great opportunity to join the best of academic and industrial research. Leading the lab are profs. Max Welling (machine learning), Arnold Smeulders (computer vision analysis), and Cees Snoek (image categorization). The lab will pursue world-class research on computer vision and machine learning. We are looking for 3 postdoctoral researchers and 8 PhD candidates in Computer Vision and Deep Learning.
The ICMR2015 paper entitled Latent Factors of Visual Popularity Prediction by Spencer Cappallo and Thomas Mensink and Cees G. M. Snoek is now available. Predicting the popularity of an image on social networks based solely on its visual content is a difficult problem. One image may become widely distributed and repeatedly shared, while another similar image may be totally overlooked. We aim to gain insight into how visual content affects image popularity. We propose a latent ranking approach that takes into account not only the distinctive visual cues in popular images, but also those in unpopular images. This method is evaluated on two existing datasets collected from photo-sharing websites, as well as a new proposed dataset of images from the microblogging website Twitter. Our experiments investigate factors of the ranking model, the level of user engagement in scoring popularity, and whether the discovered senses are meaningful. The proposed approach yields state of the art results, and allows for insight into the semantics of image popularity on social networks.