# Publications

## Conference Papers

1. Zenglin Shi, Pascal Mettes, and Cees G. M. Snoek, "Counting with Focus for Free," in ArXive, 2019.
@INPROCEEDINGS{ShiARXIVE19,   author = {Zenglin Shi and Pascal Mettes and Cees G. M. Snoek},   title = {Counting with Focus for Free},   booktitle = {ArXive},   month = {},   year = {2019},   address = {},   pdf = {https://arxiv.org/abs/1903.12206},   abstract = { This paper aims to count arbitrary objects in images. The leading counting approaches start from point annotations per object from which they construct density maps. Then, their training objective transforms input images to density maps through deep convolutional networks. We posit that the point annotations serve more supervision purposes than just constructing density maps. We introduce ways to repurpose the points for free. First, we propose supervised focus from segmentation, where points are converted into binary maps. The binary maps are combined with a network branch and accompanying loss function to focus on areas of interest. Second, we propose supervised focus from global density, where the ratio of point annotations to image pixels is used in another branch to regularize the overall density estimation. To assist both the density estimation and the focus from segmentation, we also introduce an improved kernel size estimator for the point annotations. Experiments on four datasets show that all our contributions reduce the counting error, regardless of the base network, resulting in state-of-the-art accuracy using only a single network. Finally, we are the first to count on WIDER FACE, allowing us to show the benefits of our approach in handling varying object scales and crowding levels. } }
2. William Thong, Cees G. M. Snoek, and Arnold W. M. Smeulders, "Cooperative Embeddings for Instance, Attribute and Category Retrieval," in ArXive, 2019.
@INPROCEEDINGS{ThongARXIVE19,   author = {William Thong and Cees G. M. Snoek and Arnold W. M. Smeulders},   title = {Cooperative Embeddings for Instance, Attribute and Category Retrieval},   booktitle = {ArXive},   month = {},   year = {2019},   address = {},   pdf = {https://arxiv.org/abs/1904.01421},   abstract = { The goal of this paper is to retrieve an image based on instance, attribute and category similarity notions. Different from existing works, which usually address only one of these entities in isolation, we introduce a cooperative embedding to integrate them while preserving their specific level of semantic representation. An algebraic structure defines a superspace filled with instances. Attributes are axis-aligned to form subspaces, while categories influence the arrangement of similar instances. These relationships enable them to cooperate for their mutual benefits for image retrieval. We derive a proxy-based softmax embedding loss to learn simultaneously all similarity measures in both superspace and subspaces. We evaluate our model on datasets from two different domains. Experiments on image retrieval tasks show the benefits of the cooperative embeddings for modeling multiple image similarities, and for discovering style evolution of instances between- and within-categories. } }
3. Federico Landi, Cees G. M. Snoek, and Rita Cucchiara, "Anomaly Locality in Video Surveillance," in ArXive, 2019.
@INPROCEEDINGS{LandiARXIVE19,   author = {Federico Landi and Cees G. M. Snoek and Rita Cucchiara},   title = {Anomaly Locality in Video Surveillance},   booktitle = {ArXive},   month = {},   year = {2019},   address = {},   pdf = {https://arxiv.org/abs/1901.10364},   data = {http://imagelab.ing.unimore.it/UCFCrime2Local},   abstract = { This paper strives for the detection of real-world anomalies such as burglaries and assaults in surveillance videos. Although anomalies are generally local, as they happen in a limited portion of the frame, none of the previous works on the subject has ever studied the contribution of locality. In this work, we explore the impact of considering spatiotemporal tubes instead of whole-frame video segments. For this purpose, we enrich existing surveillance videos with spatial and temporal annotations: it is the first dataset for anomaly detection with bounding box supervision in both its train and test set. Our experiments show that a network trained with spatiotemporal tubes performs better than its analogous model trained with whole-frame videos. In addition, we discover that the locality is robust to different kinds of errors in the tube extraction phase at test time. Finally, we demonstrate that our network can provide spatiotemporal proposals for unseen surveillance videos leveraging only video-level labels. By doing, we enlarge our spatiotemporal anomaly dataset without the need for further human labeling. } }
4. Shuai Liao, Efstratios Gavves, and Cees G. M. Snoek, "Spherical Regression: Learning Viewpoints, Surface Normals and 3D Rotations on n-Spheres," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019.
@INPROCEEDINGS{LiaoCVPR19,   author = {Shuai Liao and Efstratios Gavves and Cees G. M. Snoek},   title = {Spherical Regression: Learning Viewpoints, Surface Normals and 3D Rotations on n-Spheres},   booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},   month = {June},   year = {2019},   address = {Long Beach, USA},   pdf = {https://arxiv.org/abs/1904.05404},   abstract = { Many computer vision challenges require continuous outputs, but tend to be solved by discrete classification. The reason is classification's natural containment within a probability n-simplex, as defined by the popular softmax activation function. Regular regression lacks such a closed geometry, leading to unstable training and convergence to suboptimal local minima. Starting from this insight we revisit regression in convolutional neural networks. We observe many continuous output problems in computer vision are naturally contained in closed geometrical manifolds, like the Euler angles in viewpoint estimation or the normals in surface normal estimation. A natural framework for posing such continuous output problems are n-spheres, which are naturally closed geometric manifolds defined in the R^{(n+1)} space. By introducing a spherical exponential mapping on n-spheres at the regression output, we obtain well-behaved gradients, leading to stable training. We show how our spherical regression can be utilized for several computer vision challenges, specifically viewpoint estimation, surface normal estimation and 3D rotation estimation. For all these problems our experiments demonstrate the benefit of spherical regression. All paper resources are available at https://github.com/leoshine/Spherical_Regression. } }
5. Jiaojiao Zhao and Cees G. M. Snoek, "Dance with Flow: Two-in-One Stream Action Detection," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019.
@INPROCEEDINGS{ZhaoCVPR19,   author = {Jiaojiao Zhao and Cees G. M. Snoek},   title = {Dance with Flow: Two-in-One Stream Action Detection},   booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},   month = {June},   year = {2019},   address = {Long Beach, USA},   pdf = {https://arxiv.org/abs/1904.00696},   abstract = { The goal of this paper is to detect the spatio-temporal extent of an action. The two-stream detection network based on RGB and flow provides state-of-the-art accuracy at the expense of a large model-size and heavy computation. We propose to embed RGB and optical-flow into a single two-in-one stream network with new layers. A motion condition layer extracts motion information from flow images, which is leveraged by the motion modulation layer to generate transformation parameters for modulating the low-level RGB features. The method is easily embedded in existing appearance- or two-stream action detection networks, and trained end-to-end. Experiments demonstrate that leveraging the motion condition to modulate RGB features improves detection accuracy. With only half the computation and parameters of the state-of-the-art two-stream methods, our two-in-one stream still achieves impressive results on UCF101-24, UCFSports and J-HMDB. } }
6. Pascal Mettes, Elise van der Pol, and Cees G. M. Snoek, "Hyperspherical Prototype Networks," in ArXive, 2019.
@INPROCEEDINGS{MettesARXIVE19,   author = {Pascal Mettes and Elise van der Pol and Cees G. M. Snoek},   title = {Hyperspherical Prototype Networks},   booktitle = {ArXive},   month = {},   year = {2019},   address = {},   pdf = {https://arxiv.org/abs/1901.10514},   abstract = { This paper introduces hyperspherical prototype networks, which unify regression and classification by prototypes on hyperspherical output spaces. Rather than defining prototypes as the mean output vector over training examples per class, we propose hyperspheres as output spaces to define class prototypes a priori with large margin separation. By doing so, we do not require any prototype updating, we can handle any training size, and the output dimensionality is no longer constrained to the number of classes. Furthermore, hyperspherical prototype networks generalize to regression, by optimizing outputs as an interpolation between two prototypes on the hypersphere. Since both tasks are now defined by the same loss function, they can be jointly optimized for multi-task problems. Experimental evaluation shows the benefits of hyperspherical prototype networks for classification, regression, and their combination. } }
7. Tao Hu, Pengwan Yang, Chiliang Zhang, Gang Yu, Yadong Mu, and Cees Snoek, "Attention-based Multi-Context Guiding for Few-Shot Semantic Segmentation," in AAAI Conference on Artificial Intelligence, Honolulu, Hawaii, USA, 2019.
@INPROCEEDINGS{HuAAAI19,   author = {Tao Hu and Pengwan Yang and Chiliang Zhang and Gang Yu and Yadong Mu and Cees Snoek},   title = {Attention-based Multi-Context Guiding for Few-Shot Semantic Segmentation},   booktitle = {AAAI Conference on Artificial Intelligence},   month = {January},   year = {2019},   address = {Honolulu, Hawaii, USA},   pdf = {},   abstract = { } }
8. Amir Ghodrati, Efstratios Gavves, and Cees G. M. Snoek, "Video Time: Properties, Encoders and Evaluation," in Proceedings of the British Machine Vision Conference, Newcastle upon Tyne, UK, 2018.
Spotlight presentation, top 6.6%
@INPROCEEDINGS{GhodratiBMVC18,   author = {Amir Ghodrati and Efstratios Gavves and Cees G. M. Snoek},   title = {Video Time: Properties, Encoders and Evaluation},   booktitle = {Proceedings of the British Machine Vision Conference},   month = {September},   year = {2018},   pages = {},   address = {Newcastle upon Tyne, UK},   pdf = {https://arxiv.org/abs/1807.06980},   note = {Spotlight presentation, top 6.6%},   abstract = { Time-aware encoding of frame sequences in a video is a fundamental problem in video understanding. While many attempted to model time in videos, an explicit study on quantifying video time is missing. To fill this lacuna, we aim to evaluate video time explicitly. We describe three properties of video time, namely a) temporal asymmetry, b)temporal continuity and c) temporal causality. Based on each we formulate a task able to quantify the associated property. This allows assessing the effectiveness of modern video encoders, like C3D and LSTM, in their ability to model time. Our analysis provides insights about existing encoders while also leading us to propose a new video time encoder, which is better suited for the video time recognition tasks than C3D and LSTM. We believe the proposed meta-analysis can provide a reasonable baseline to assess video time encoders on equal grounds on a set of temporal-aware tasks. } }
9. Jiaojiao Zhao, Li Liu, Cees G. M. Snoek, Jungong Han, and Ling Shao, "Pixel-level Semantics Guided Image Colorization," in Proceedings of the British Machine Vision Conference, Newcastle upon Tyne, UK, 2018.
Oral presentation, top 4.3%
@INPROCEEDINGS{ZhaoBMVC18,   author = {Jiaojiao Zhao and Li Liu and Cees G. M. Snoek and Jungong Han and Ling Shao},   title = {Pixel-level Semantics Guided Image Colorization},   booktitle = {Proceedings of the British Machine Vision Conference},   month = {September},   year = {2018},   pages = {},   address = {Newcastle upon Tyne, UK},   pdf = {https://arxiv.org/abs/1808.01597},   note = {Oral presentation, top 4.3%},   abstract = { While many image colorization algorithms have recently shown the capability of producing plausible color versions from gray-scale photographs, they still suffer from the problems of context confusion and edge color bleeding. To address context confusion, we propose to incorporate the pixel-level object semantics to guide the image colorization. The rationale is that human beings perceive and distinguish colors based on the object?s semantic categories. We propose a hierarchical neural network with two branches. One branch learns what the object is while the other branch learns the object?s colors. The network jointly optimizes a semantic segmentation loss and a colorization loss. To attack edge color bleeding we generate more continuous color maps with sharp edges by adopting a joint bilateral upsamping layer at inference. Our network is trained on PASCAL VOC2012 and COCO-stuff with semantic segmentation labels and it produces more realistic and finer results compared to the colorization state-of-the-art. } }
10. Gosia Migut, Dennis Koelma, Cees G. M. Snoek, and Natasa Brouwer-Zupancic, "Cheat me not: automated proctoring of digital exams on Bring-Your-Own-Device," in ACM Conference on Innovation and Technology in Computer Science Education, Larnaca, Cyprus, 2018.
@INPROCEEDINGS{MigutITiCSE18,   author = {Gosia Migut and Dennis Koelma and Cees G. M. Snoek and Natasa Brouwer-Zupancic},   title = {Cheat me not: automated proctoring of digital exams on Bring-Your-Own-Device},   booktitle = {ACM Conference on Innovation and Technology in Computer Science Education},   month = {July},   year = {2018},   address = {Larnaca, Cyprus},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/migut-proctoring-itcse2018.pdf},   abstract = { Detecting fraud in digital assessment is currently done by human proctor, that observes recordings of the exam. This is costly, tedious and time consuming process. In this paper we present preliminary results on automated video proctoring, which has the potential to significantly reduce manual effort and scale-up digital assessment, while retaining good fraud detection. } }
11. Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, and Cees G. M. Snoek, "Actor and Action Video Segmentation from a Sentence," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018.
Oral presentation, top 2.1%
@INPROCEEDINGS{GavrilyukCVPR18,   author = {Kirill Gavrilyuk and Amir Ghodrati and Zhenyang Li and Cees G. M. Snoek},   title = {Actor and Action Video Segmentation from a Sentence},   booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},   month = {June},   year = {2018},   address = {Salt Lake City, USA},   pdf = {https://arxiv.org/abs/1803.07485},   software = {https://kgavrilyuk.github.io/publication/actor_action/},   data = {https://kgavrilyuk.github.io/publication/actor_action/},   note = {Oral presentation, top 2.1%},   abstract = { This paper strives for pixel-level segmentation of actors and their actions in video content. Different from existing works, which all learn to segment from a fixed vocabulary of actor and action pairs, we infer the segmentation from a natural language input sentence. This allows to distinguish between fine-grained actors in the same super-category, identify actor and action instances, and segment pairs that are outside of the actor and action vocabulary. We propose a fully-convolutional model for pixel-level actor and action segmentation using an encoder-decoder architecture optimized for video. To show the potential of actor and action video segmentation from a sentence, we extend two popular actor and action datasets with more than 7,500 natural language descriptions. Experiments demonstrate the quality of the sentence-guided segmentations, the generalization ability of our model, and its advantage for traditional actor and action segmentation compared to the state-of-the-art. } }
12. Tom Runia, Cees G. M. Snoek, and Arnold W. M. Smeulders, "Real-World Repetition Estimation by Div, Grad and Curl," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018.
Spotlight presentation, top 6.6%
@INPROCEEDINGS{RuniaCVPR18,   author = {Tom Runia and Cees G. M. Snoek and Arnold W. M. Smeulders},   title = {Real-World Repetition Estimation by Div, Grad and Curl},   booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},   month = {June},   year = {2018},   address = {Salt Lake City, USA},   pdf = {https://arxiv.org/abs/1802.09971},   software = {http://tomrunia.github.io/projects/repetition/},   data = {http://tomrunia.github.io/projects/repetition/},   demo = {https://www.youtube.com/watch?v=CSrai1_KOxE},   note = {Spotlight presentation, top 6.6%},   abstract = { We consider the problem of estimating repetition in video, such as performing push-ups, cutting a melon or playing violin. Existing work shows good results under the assumption of static and stationary periodicity. As realistic video is rarely perfectly static and stationary, the often preferred Fourier-based measurements is inapt. Instead, we adopt the wavelet transform to better handle non-static and non-stationary video dynamics. From the flow field and its differentials, we derive three fundamental motion types and three motion continuities of intrinsic periodicity in 3D. On top of this, the 2D perception of 3D periodicity considers two extreme viewpoints. What follows are 18 fundamental cases of recurrent perception in 2D. In practice, to deal with the variety of repetitive appearance, our theory implies measuring time-varying flow and its differentials (gradient, divergence and curl) over segmented foreground motion. For experiments, we introduce the new QUVA Repetition dataset, reflecting reality by including non-static and non-stationary videos. On the task of counting repetitions in video, we obtain favorable results compared to a deep learning alternative. } }
13. Shuai Liao, Efstratios Gavves, and Cees G. M. Snoek, "Searching and Matching Texture-free 3D Shapes in Images," in Proceedings of the ACM International Conference on Multimedia Retrieval, Yokohama, Japan, 2018.
@INPROCEEDINGS{LiaoICMR18,   author = {Shuai Liao and Efstratios Gavves and Cees G. M. Snoek},   title = {Searching and Matching Texture-free 3D Shapes in Images},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},   month = {June},   year = {2018},   pages = {},   address = {Yokohama, Japan},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/liao-searching-3Dshapes-icmr2018.pdf},   abstract = { The goal of this paper is to search and match the best rendered view of a texture-free 3D shape to an object of interest in a 2D query image. Matching rendered views of 3D shapes to RGB images is challenging because, 1) 3D shapes are not always a perfect match for the image queries, 2) there is great domain difference between rendered and RGB images, and 3) estimating the object scale versus distance is inherently ambiguous in images from uncalibrated cameras. In this work we propose a deeply learned matching function that attacks these challenges and can be used for a search engine that finds the appropriate 3D shape and matches it to objects in 2D query images. We evaluate the proposed matching function and search engine with a series of controlled experiments on the 24 most populated vehicle categories in PASCAL3D+. We test the capability of the learned matching function in transferring to unseen 3D shapes and study overall search engine sensitivity w.r.t. available 3D shapes and object localization accuracy, showing promising results in retrieving 3D shapes given 2D image queries. } }
14. Pascal Mettes and Cees G. M. Snoek, "Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions," in Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017.
Oral presentation, top 2.1%
@INPROCEEDINGS{MettesICCV17,   author = {Pascal Mettes and Cees G. M. Snoek},   title = {Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions},   booktitle = {Proceedings of the {IEEE} International Conference on Computer Vision},   month = {October},   year = {2017},   address = {Venice, Italy},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mettes-spatial-aware-iccv2017.pdf},   software = {https://github.com/psmmettes/spatial-aware-object-embeddings},   note = {Oral presentation, top 2.1%},   abstract = { We aim for zero-shot localization and classification of human actions in video. Where traditional approaches rely on global attribute or object classification scores for their zero-shot knowledge transfer, our main contribution is a spatial-aware object embedding. To arrive at spatial awareness, we build our embedding on top of freely available actor and object detectors. Relevance of objects is determined in a word embedding space and further enforced with estimated spatial preferences. Besides local object awareness, we also embed global object awareness into our embedding to maximize actor and object interaction. Finally, we exploit the object positions and sizes in the spatial-aware embedding to demonstrate a new spatio-temporal action retrieval scenario with composite queries. Action localization and classification experiments on four contemporary action video datasets support our proposal. Apart from state-of-the-art results in the zero-shot localization and classification settings, our spatial-aware embedding is even competitive with recent supervised action localization alternatives. } }
15. Spencer Cappallo and Cees G. M. Snoek, "Future-Supervised Retrieval of Unseen Queries for Live Video," in Proceedings of the ACM International Conference on Multimedia, Mountain View, USA, 2017.
@INPROCEEDINGS{CappalloMM17,   author = {Spencer Cappallo and Cees G. M. Snoek},   title = {Future-Supervised Retrieval of Unseen Queries for Live Video},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia},   month = {October},   year = {2017},   pages = {},   address = {Mountain View, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/cappallo-future-supervised-mm2017.pdf},   abstract = { Live streaming video presents new challenges for retrieval and content understanding. Its live nature means that video representations should be relevant to current content, and not necessarily to past content. We investigate retrieval of previously unseen queries for live video content. Drawing from existing whole-video techniques, we focus on adapting image-trained semantic models to the video domain. We introduce the use of future frame representations as a supervision signal for learning temporally aware semantic representations on unlabeled video data. Additionally, we introduce an approach for broadening a query?s representation within a pre-constructed semantic space, with the aim of increasing overlap between embedded visual semantics and the query semantics. We demonstrate the efficacy of these contributions for unseen query retrieval on live videos. We further explore their applicability to tasks such as no example, whole-video action classification and no-example live video action prediction, and demonstrate state of the art results. } }
16. Pascal Mettes, Cees G. M. Snoek, and Shih-Fu Chang, "Localizing Actions from Video Labels and Pseudo-Annotations," in Proceedings of the British Machine Vision Conference, London, UK, 2017.
@INPROCEEDINGS{MettesBMVC17,   author = {Pascal Mettes and Cees G. M. Snoek and Shih-Fu Chang},   title = {Localizing Actions from Video Labels and Pseudo-Annotations},   booktitle = {Proceedings of the British Machine Vision Conference},   month = {September},   year = {2017},   pages = {},   address = {London, UK},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mettes-pseudo-annotations-bmvc2017.pdf},   abstract = { The goal of this paper is to determine the spatio-temporal location of actions in video. Where training from hard to obtain box annotations is the norm, we propose an intuitive and effective algorithm that localizes actions from their class label only. We are inspired by recent work showing that unsupervised action proposals selected with human point-supervision perform as well as using expensive box annotations. Rather than asking users to provide point supervision, we propose fully automatic visual cues that replace manual point annotations. We call the cues pseudo-annotations, introduce five of them, and propose a correlation metric for automatically selecting and combining them. Thorough evaluation on challenging action localization datasets shows that we reach results comparable to results with full box supervision. We also show that pseudo-annotations can be leveraged during testing to improve weakly- and strongly-supervised localizers. } }
17. Zhenyang Li, Ran Tao, Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders, "Tracking by Natural Language Specification," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017.
@INPROCEEDINGS{LiCVPR17,   author = {Zhenyang Li and Ran Tao and Efstratios Gavves and Cees G. M. Snoek and Arnold W. M. Smeulders},   title = {Tracking by Natural Language Specification},   booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},   month = {July},   year = {2017},   address = {Honolulu, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-tracking-language-cvpr2017.pdf},   abstract = { This paper strives to track a target object in a video. Rather than specifying the target in the first frame of a video by a bounding box, we propose to track the object based on a natural language specification of the target, which provides a more natural human-machine interaction as well as a means to improve tracking results. We define three variants of tracking by language specification: one relying on lingual target specification only, one relying on visual target specification based on language, and one leveraging their joint capacity. To show the potential of tracking by natural language specification we extend two popular tracking datasets with lingual descriptions and report experiments. Finally, we also sketch new tracking scenarios in surveillance and other live video streams that become feasible with a lingual specification of the target. } }
18. Thomas Mensink, Thomas Jongstra, Pascal Mettes, and Cees G. M. Snoek, "Music-Guided Video Summarization using Quadratic Assignments," in Proceedings of the ACM International Conference on Multimedia Retrieval, Bucharest, Romania, 2017.
@INPROCEEDINGS{MensinkICMR17,   author = {Thomas Mensink and Thomas Jongstra and Pascal Mettes and Cees G. M. Snoek},   title = {Music-Guided Video Summarization using Quadratic Assignments},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},   month = {June},   year = {2017},   pages = {},   address = {Bucharest, Romania},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mensink-music-video-summarization-icmr2017.pdf},   demo = {http://isis-data.science.uva.nl/cgmsnoek/pub/mensink-music-video-summarization-icmr2017.mp4},   abstract = { This paper aims to automatically generate a summary of an unedited video, guided by an externally provided music-track. The tempo, energy and beats in the music determine the choices and cuts in the video summarization. To solve this challenging task, we model video summarization as a quadratic assignment problem. We assign frames to the summary, using rewards based on frame interestingness, plot coherency, audio-visual match, and cut properties. Experimentally we validate our approach on the SumMe dataset. The results show that our music guided summaries are more appealing, and even outperform the current state-of-the-art summarization methods when evaluated on the F1 measure of precision and recall. } }
19. Rama Kovvuri, Ram Nevatia, and Cees G. M. Snoek, "Segment-based Models for Event Detection and Recounting," in Proceedings of the International Conference on Pattern Recognition, Cancun, Mexico, 2016.
@INPROCEEDINGS{KovvuriICPR16,   author = {Rama Kovvuri and Ram Nevatia and Cees G. M. Snoek},   title = {Segment-based Models for Event Detection and Recounting},   booktitle = {Proceedings of the International Conference on Pattern Recognition},   month = {December},   year = {2016},   pages = {},   address = {Cancun, Mexico},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/kovvuri-segment-models-icpr2016.pdf},   abstract = { We present a novel approach towards web video classification and recounting that uses video segments to model an event. This approach overcomes the limitations faced by the classical video-level models such as modeling semantics, identifying informative segments in a video and background segment suppression. We posit that segment-based models are able to identify both the frequently-occurring and rarer patterns in an event effectively, despite being trained on only a fraction of the training data. Our framework employs a discriminative approach to optimize our models in distributed and data-driven fashion while maintaining semantic interpretability. We evaluate the effectiveness of our approach on the challenging TRECVID MEDTest 2014 dataset. We demonstrate improvements in recounting and classification, particularly in events characterized by inherent intra-class variations. } }
20. Jianfeng Dong, Xirong Li, Weiyu Lan, Yujia Huo, and Cees G. M. Snoek, "Early Embedding and Late Reranking for Video Captioning," in Proceedings of the ACM International Conference on Multimedia, Amsterdam, The Netherlands, 2016.
Multimedia Grand Challenge winner
@INPROCEEDINGS{DongMM16,   author = {Jianfeng Dong and Xirong Li and Weiyu Lan and Yujia Huo and Cees G. M. Snoek},   title = {Early Embedding and Late Reranking for Video Captioning},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia},   month = {October},   year = {2016},   pages = {},   address = {Amsterdam, The Netherlands},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/dong-captioning-mm2016.pdf},   note = {Multimedia Grand Challenge winner},   abstract = { This paper describes our solution for the MSR Video to Language Challenge. We start from the popular ConvNet + LSTM model, which we extend with two novel modules. One is early embedding, which enriches the current low-level input to LSTM by tag embeddings. The other is late reranking, for re-scoring generated sentences in terms of their relevance to a specific video. The modules are inspired by recent works on image captioning, repurposed and redesigned for video. As experiments on the MSR-VTT validation set show, the joint use of these two modules add a clear improvement over a non-trivial ConvNet + LSTM baseline under four performance metrics. The viability of the proposed solution is further confirmed by the blind test by the organizers. Our system is ranked at the 4th place in terms of overall performance, while scoring the best CIDEr-D, which measures the human-likeness of generated captions. } }
21. Roeland De Geest, Efstratios Gavves, Amir Ghodrati, Zhenyang Li, Cees G. M. Snoek, and Tinne Tuytelaars, "Online Action Detection," in European Conference on Computer Vision, Amsterdam, The Netherlands, 2016.
@INPROCEEDINGS{GeestECCV16,   author = {Roeland De Geest and Efstratios Gavves and Amir Ghodrati and Zhenyang Li and Cees G. M. Snoek and Tinne Tuytelaars},   title = {Online Action Detection},   booktitle = {European Conference on Computer Vision},   month = {October},   year = {2016},   pages = {},   address = {Amsterdam, The Netherlands},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/geest-online-action-eccv2016.pdf},   data = {http://homes.esat.kuleuven.be/~rdegeest/TVSeries.html},   abstract = { In online action detection, the goal is to detect the start of an action in a video stream as soon as it happens. For instance, if a child is chasing a ball, an autonomous car should recognize what is going on and respond immediately. This is a very challenging problem for four reasons. First, only partial actions are observed. Second, there is a large variability in negative data. Third, the start of the action is unknown, so it is unclear over what time window the information should be integrated. Finally, in real world data, large within-class variability exists. This problem has been addressed before, but only to some extent. Our contributions to online action detection are threefold. First, we introduce a realistic dataset composed of 27 episodes from 6 popular TV series. The dataset spans over 16 hours of footage annotated with 30 action classes, totaling 6,231 action instances. Second, we analyze and compare various baseline methods, showing this is a challenging problem for which none of the methods provides a good solution. Third, we analyze the change in performance when there is a variation in viewpoint, occlusion, truncation, etc. We introduce an evaluation protocol for fair comparison. The dataset, the baselines and the models will all be made publicly available to encourage (much needed) further research on online action detection on realistic data. } }
22. Pascal Mettes, Jan C. van Gemert, and Cees G. M. Snoek, "Spot On: Action Localization from Pointly-Supervised Proposals," in European Conference on Computer Vision, Amsterdam, The Netherlands, 2016.
Oral presentation, top 1.8%
@INPROCEEDINGS{MettesECCV16,   author = {Pascal Mettes and Jan C. van Gemert and Cees G. M. Snoek},   title = {Spot On: Action Localization from Pointly-Supervised Proposals},   booktitle = {European Conference on Computer Vision},   month = {October},   year = {2016},   pages = {},   address = {Amsterdam, The Netherlands},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mettes-pointly-eccv2016.pdf},   data = {http://isis-data.science.uva.nl/mettes/hollywood2tubes.tar.gz},   note = {Oral presentation, top 1.8%},   abstract = { We strive for spatio-temporal localization of actions in videos. The state-of-the-art relies on action proposals at test time and selects the best one with a classifier demanding carefully annotated box annotations at train time. Annotating action boxes in video is cumbersome, tedious, and error prone. Rather than annotating boxes, we propose to annotate actions in video with points on a sparse subset of frames only. We introduce an overlap measure between action proposals and points and incorporate them all into the objective of a non-convex Multiple Instance Learning optimization. Experimental evaluation on the UCF Sports and UCF 101 datasets shows that (i) spatio-temporal proposals can be used to train classifiers while retaining the localization performance, (ii) point annotations yield results comparable to box annotations while being significantly faster to annotate, (iii) with a minimum amount of supervision our approach is competitive to the state-of-the-art. Finally, we introduce spatio-temporal action annotations on the train and test videos of Hollywood2, resulting in Hollywood2Tubes, available at tinyurl.com/hollywood2tubes. } }
23. Spencer Cappallo, Thomas Mensink, and Cees G. M. Snoek, "Video Stream Retrieval of Unseen Queries using Semantic Memory," in Proceedings of the British Machine Vision Conference, York, UK, 2016.
@INPROCEEDINGS{CappalloBMVC16,   author = {Spencer Cappallo and Thomas Mensink and Cees G. M. Snoek},   title = {Video Stream Retrieval of Unseen Queries using Semantic Memory},   booktitle = {Proceedings of the British Machine Vision Conference},   month = {September},   year = {2016},   pages = {},   address = {York, UK},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/cappallo-videostream-bmvc2016.pdf},   abstract = { Retrieval of live, user-broadcast video streams is an under-addressed and increasingly relevant challenge. The on-line nature of the problem requires temporal evaluation and the unforeseeable scope of potential queries motivates an approach which can accommodate arbitrary search queries. To account for the breadth of possible queries, we adopt a no-example approach to query retrieval, which uses a query's semantic relatedness to pre-trained concept classifiers. To adapt to shifting video content, we propose memory pooling and memory welling methods that favor recent information over long past content. We identify two stream retrieval tasks, instantaneous retrieval at any particular time and continuous retrieval over a prolonged duration, and propose means for evaluating them. Three large scale video datasets are adapted to the challenge of stream retrieval. We report results for our search methods on the new stream retrieval tasks, as well as demonstrate their efficacy in a traditional, non-streaming video task. } }
24. Svetlana Kordumova, Thomas Mensink, and Cees G. M. Snoek, "Pooling Objects for Recognizing Scenes without Examples," in Proceedings of the ACM International Conference on Multimedia Retrieval, New York, USA, 2016.
Best paper award
@INPROCEEDINGS{KordumovaICMR16,   author = {Svetlana Kordumova and Thomas Mensink and Cees G. M. Snoek},   title = {Pooling Objects for Recognizing Scenes without Examples},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},   month = {June},   year = {2016},   pages = {},   address = {New York, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/kordumova-pooling-objects-icmr2016.pdf},   note = {Best paper award},   abstract = { In this paper we aim to recognize scenes in images without using any scene images as training data. Different from attribute based approaches, we do not carefully select the training classes to match the unseen scene classes. Instead, we propose a pooling over ten thousand of off-the-shelf object classifiers. To steer the knowledge transfer between objects and scenes we learn a semantic embedding with the aid of a large social multimedia corpus. Our key contributions are: we are the first to investigate pooling over ten thousand object classifiers to recognize scenes without examples; we explore the ontological hierarchy of objects and analyze the influence of object classifiers from different hierarchy levels; we exploit object positions in scene images and we demonstrate a new scene retrieval scenario with complex queries. Finally, we outperform attribute representations on two challenging scene datasets, SUNAttributes and Places2. } }
25. Pascal Mettes, Dennis Koelma, and Cees G. M. Snoek, "The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection," in Proceedings of the ACM International Conference on Multimedia Retrieval, New York, USA, 2016.
@INPROCEEDINGS{MettesICMR16,   author = {Pascal Mettes and Dennis Koelma and Cees G. M. Snoek},   title = {The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},   month = {June},   year = {2016},   pages = {},   address = {New York, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mettest-imagenetshuffle-icmr2016.pdf},   data = {http://tinyurl.com/imagenetshuffle},   abstract = { This paper strives for video event detection using a representation learned from deep convolutional neural networks. Different from the leading approaches, who all learn from the 1,000 classes defined in the ImageNet Large Scale Visual Recognition Challenge, we investigate how to leverage the complete ImageNet hierarchy for pre-training deep networks. To deal with the problems of over-specific classes and classes with few images, we introduce a bottom-up and top-down approach for reorganization of the ImageNet hierarchy based on all its 21,814 classes and more than 14 million images. Experiments on the TRECVID Multimedia Event Detection 2013 and 2015 datasets show that video representations derived from the layers of a deep neural network pre-trained with our reorganized hierarchy i) improves over standard pre-training, ii) is complementary among different reorganizations, iii) maintains the benefits of fusion with other modalities, and iv) leads to state-of-the-art event detection results. The reorganized hierarchies and their derived Caffe models are publicly available at http://tinyurl.com/imagenetshuffle. } }
26. Arnav Agharwal, Rama Kovvuri Ram Nevatia, and Cees G. M. Snoek, "Tag-based Video Retrieval by Embedding Semantic Content in a Continuous Word Space," in IEEE Winter Conference on Applications of Computer Vision, Lake Placid, USA, 2016, pp. 1-8.
@INPROCEEDINGS{AgharwalWACV16,   author = {Arnav Agharwal and Rama Kovvuri Ram Nevatia and Cees G. M. Snoek},   title = {Tag-based Video Retrieval by Embedding Semantic Content in a Continuous Word Space},   booktitle = {IEEE Winter Conference on Applications of Computer Vision},   month = {March},   year = {2016},   pages = {1--8},   address = {Lake Placid, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/agharwal-continuous-wacv2016.pdf},   abstract = { Content-based event retrieval in unconstrained web videos, based on query tags, is a hard problem due to large intra-class variances, and limited vocabulary and accuracy of the video concept detectors, creating a "semantic query gap". We present a technique to overcome this gap by using continuous word space representations to explicitly compute query and detector concept similarity. This not only allows for fast query-video similarity computation with implicit query expansion, but leads to a compact video representation, which allows implementation of a real-time retrieval system that can fit several thousand videos in a few hundred megabytes of memory. We evaluate the effectiveness of our representation on the challenging NIST MEDTest 2014 dataset. } }
27. Svetlana Kordumova, Jan C. van Gemert, and Cees G. M. Snoek, "Exploring the Long Tail of Social Media Tags," in International Conference on Multimedia Modelling, Miami, USA, 2016.
@INPROCEEDINGS{KordumovaMMM16,   author = {Svetlana Kordumova and Jan C. van Gemert and Cees G. M. Snoek},   title = {Exploring the Long Tail of Social Media Tags},   booktitle = {International Conference on Multimedia Modelling},   month = {January},   year = {2016},   pages = {},   address = {Miami, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/kordumova-longtail-mmm2016.pdf},   abstract = { There are millions of users who tag multimedia content, generating a large vocabulary of tags. Some tags are frequent, while other tags are rarely used following a long tail distribution. For frequent tags, most of the multimedia methods that aim to automatically understand audio-visual content, give excellent results. It is not clear, however, how these methods will perform on rare tags. In this paper we investigate what social tags constitute the long tail and how they perform on two multimedia retrieval scenarios, tag relevance and detector learning. We show common valuable tags within the long tail, and by augmenting them with semantic knowledge, the performance of tag relevance and detector learning improves substantially. } }
28. Efstratios Gavves, Thomas Mensink, Tatiana Tommasi, Cees G. M. Snoek, and Tinne Tuytelaars, "Active Transfer Learning with Zero-Shot Priors: Reusing Past Datasets for Future Tasks," in Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 2015.
@INPROCEEDINGS{GavvesICCV15,   author = {Efstratios Gavves and Thomas Mensink and Tatiana Tommasi and Cees G. M. Snoek and Tinne Tuytelaars},   title = {Active Transfer Learning with Zero-Shot Priors: Reusing Past Datasets for Future Tasks},   booktitle = {Proceedings of the {IEEE} International Conference on Computer Vision},   pages = {},   month = {December},   year = {2015},   address = {Santiago, Chile},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/gavves-zero-shot-priors-iccv2015.pdf},   abstract = { How can we reuse existing knowledge, in the form of available datasets, when solving a new and apparently unrelated target task from a set of unlabeled data? In this work we make a first contribution to answer this question in the context of image classification. We frame this quest as an active learning problem and use zero-shot classifiers to guide the learning process by linking the new task to the existing classifiers. By revisiting the dual formulation of adaptive SVM, we reveal two basic conditions to choose greedily only the most relevant samples to be annotated. On this basis we propose an effective active learning algorithm which learns the best possible target classification model with minimum human labeling effort. Extensive experiments on two challenging datasets show the value of our approach compared to the state-of-the-art active learning methodologies, as well as its potential to reuse past datasets with minimal effort for future tasks. } }
29. Mihir Jain, Jan C. van Gemert, Thomas Mensink, and Cees G. M. Snoek, "Objects2action: Classifying and localizing actions without any video example," in Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 2015.
@INPROCEEDINGS{JainICCV15,   author = {Mihir Jain and Jan C. van Gemert and Thomas Mensink and Cees G. M. Snoek},   title = {Objects2action: Classifying and localizing actions without any video example},   booktitle = {Proceedings of the {IEEE} International Conference on Computer Vision},   month = {December},   year = {2015},   pages = {},   address = {Santiago, Chile},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/jain-objects2action-iccv2015.pdf},   data = {https://staff.fnwi.uva.nl/m.jain/projects/Objects2action.html},   abstract = { The goal of this paper is to recognize actions in video without the need for examples. Different from traditional zero-shot approaches we do not demand the design and specification of attribute classifiers and class-to-attribute mappings to allow for transfer from seen classes to unseen classes. Our key contribution is objects2action, a semantic word embedding that is spanned by a skip-gram model of thousands of object categories. Action labels are assigned to an object encoding of unseen video based on a convex combination of action and object affinities. Our semantic embedding has three main characteristics to accommodate for the specifics of actions. First, we propose a mechanism to exploit multiple-word descriptions of actions and objects. Second, we incorporate the automated selection of the most responsive objects per action. And finally, we demonstrate how to extend our zero-shot approach to the spatio-temporal localization of actions in video. Experiments on four action datasets demonstrate the potential of our approach. } }
30. Spencer Cappallo, Thomas Mensink, and Cees G. M. Snoek, "Image2Emoji: Zero-shot Emoji Prediction for Visual Media," in Proceedings of the ACM International Conference on Multimedia, Brisbane, Australia, 2015.
@INPROCEEDINGS{CappalloMM15,   author = {Spencer Cappallo and Thomas Mensink and Cees G. M. Snoek},   title = {Image2Emoji: Zero-shot Emoji Prediction for Visual Media},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia},   pages = {},   month = {October},   year = {2015},   address = {Brisbane, Australia},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/cappallo-image2emoji-mm2015.pdf},   demo = {http://www.emoji2video.com},   abstract = { We present Image2Emoji, a multi-modal approach for generating emoji labels for an image in a zero-shot manner. Different from existing zero-shot image-to-text approaches, we exploit both image and textual media to learn a semantic embedding for the new task of emoji prediction. We propose that the widespread adoption of emoji suggests a semantic universality which is well-suited for interaction with visual media. We quantify the efficacy of our proposed model on the MSCOCO dataset, and demonstrate the value of visual, textual and multi-modal prediction of emoji. We conclude the paper with three examples of the application potential of emoji in the context of multimedia retrieval. } }
31. Jan van Gemert, Mihir Jain, Ella Gati, and Cees G. M. Snoek, "APT: Action localization proposals from dense trajectories," in Proceedings of the British Machine Vision Conference, Swansea, UK, 2015.
@INPROCEEDINGS{GemertBMVC15,   author = {Jan van Gemert and Mihir Jain and Ella Gati and Cees G. M. Snoek},   title = {{APT}: Action localization proposals from dense trajectories},   booktitle = {Proceedings of the British Machine Vision Conference},   month = {September},   year = {2015},   pages = {},   address = {Swansea, UK},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/gemert-apt-proposals-bmvc2015-corrected.pdf},   software = {https://github.com/jvgemert/apt},   abstract = { This paper is on action localization in video with the aid of spatio-temporal proposals. To alleviate the computational expensive segmentation step of existing proposals, we propose bypassing the segmentations completely by generating proposals directly from the dense trajectories used to represent videos during classification. Our Action localization Proposals from dense Trajectories (APT) use an efficient proposal generation algorithm to handle the high number of trajectories in a video. Our spatio-temporal proposals are faster than current methods and outperform the localization and classification accuracy of current proposals on the UCF Sports, UCF 101, and MSR-II video datasets. Corrected version: we fixed a mistake in our UCF-101 ground truth. Numbers are different; conclusions are unchanged. } }
32. Markus Nagel, Thomas Mensink, and Cees G. M. Snoek, "Event Fisher Vectors: Robust Encoding Visual Diversity of Visual Streams," in Proceedings of the British Machine Vision Conference, Swansea, UK, 2015.
@INPROCEEDINGS{NagelBMVC15,   author = {Markus Nagel and Thomas Mensink and Cees G. M. Snoek},   title = {Event Fisher Vectors: Robust Encoding Visual Diversity of Visual Streams},   booktitle = {Proceedings of the British Machine Vision Conference},   month = {September},   year = {2015},   pages = {},   address = {Swansea, UK},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/nagel-event-fisher-bmvc2015.pdf},   abstract = { In this paper we focus on event recognition in visual image streams. More specifically, we aim to construct a compact representation which encodes the diversity of the visual stream from just a few observations. For this purpose, we introduce the Event Fisher Vector, a Fisher Kernel based representation to describe a collection of images or the sequential frames of a video. We explore different generative models beyond the Gaussian mixture model as underlying probability distribution. First, the Student?s-t mixture model which captures the heavy tails of the small sample size of a collection of images. Second, Hidden Markov Models to explicitly capture the temporal ordering of the observations in a stream. For all our models we derive analytical approximations of the Fisher information matrix, which significantly improves recognition performance. We extensively evaluate the properties of our proposed method on three recent datasets for event recognition in photo collections and web videos, leading to an efficient compact image representation which achieves state-of-the-art performance on all these datasets. } }
33. Spencer Cappallo, Thomas Mensink, and Cees G. M. Snoek, "Latent Factors of Visual Popularity Prediction," in Proceedings of the ACM International Conference on Multimedia Retrieval, Shanghai, China, 2015.
@INPROCEEDINGS{CappalloICMR15,   author = {Spencer Cappallo and Thomas Mensink and Cees G. M. Snoek},   title = {Latent Factors of Visual Popularity Prediction},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},   month = {June},   year = {2015},   pages = {},   address = {Shanghai, China},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/cappallo-visual-popularity-icmr2015.pdf},   abstract = { Predicting the popularity of an image on social networks based solely on its visual content is a difficult problem. One image may become widely distributed and repeatedly shared, while another similar image may be totally overlooked. We aim to gain insight into how visual content affects image popularity. We propose a latent ranking approach that takes into account not only the distinctive visual cues in popular images, but also those in unpopular images. This method is evaluated on two existing datasets collected from photo-sharing websites, as well as a new proposed dataset of images from the microblogging website Twitter. Our experiments investigate factors of the ranking model, the level of user engagement in scoring popularity, and whether the discovered senses are meaningful. The proposed approach yields state of the art results, and allows for insight into the semantics of image popularity on social networks. } }
34. Amirhossein Habibian, Thomas Mensink, and Cees G. M. Snoek, "Discovering Semantic Vocabularies for Cross-Media Retrieval," in Proceedings of the ACM International Conference on Multimedia Retrieval, Shanghai, China, 2015.
@INPROCEEDINGS{HabibianICMR15,   author = {Amirhossein Habibian and Thomas Mensink and Cees G. M. Snoek},   title = {Discovering Semantic Vocabularies for Cross-Media Retrieval},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},   month = {June},   year = {2015},   pages = {},   address = {Shanghai, China},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/habibian-semantic-vocabularies-icmr2015.pdf},   abstract = { This paper proposes a data-driven approach for cross-media retrieval by automatically learning its underlying semantic vocabulary. Different from the existing semantic vocabularies, which are manually pre-defined and annotated, we automatically discover the vocabulary concepts and their annotations from multimedia collections. To this end, we apply a probabilistic topic model on the text available in the collection to extract its semantic structure. Moreover, we propose a learning to rank framework, to effectively learn the concept classifiers from the extracted annotations. We evaluate the discovered semantic vocabulary for cross-media retrieval on three datasets of image/text and video/text pairs. Our experiments demonstrate that the discovered vocabulary does not require \emph{any} manual labeling to outperform three recent alternatives for cross-media retrieval. } }
35. Masoud Mazloom, Amirhossein Habibian, Dong Liu, Cees G. M. Snoek, and Shih-Fu Chang, "Encoding Concept Prototypes for Video Event Detection and Summarization," in Proceedings of the ACM International Conference on Multimedia Retrieval, Shanghai, China, 2015.
@INPROCEEDINGS{MazloomICMR15,   author = {Masoud Mazloom and Amirhossein Habibian and Dong Liu and Cees G. M. Snoek and Shih-Fu Chang},   title = {Encoding Concept Prototypes for Video Event Detection and Summarization},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},   month = {June},   year = {2015},   pages = {},   address = {Shanghai, China},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mazloom-concept-prototypes-icmr2015.pdf},   abstract = { This paper proposes a new semantic video representation for few and zero example event detection and unsupervised video event summarization. Different from existing works, which obtain a semantic representation by training concepts over images or entire video clips, we propose an algorithm that learns a set of relevant frames as the concept prototypes from web video examples, without the need for frame-level annotations, and use them for representing an event video. We formulate the problem of learning the concept prototypes as seeking the frames closest to the densest region in the feature space of video frames from both positive and negative training videos of a target concept. We study the behavior of our video event representation based on concept prototypes by performing three experiments on challenging web videos from the TRECVID 2013 multimedia event detection task and the MED-summaries dataset. Our experiments establish that i) Event detection accuracy increases when mapping each video into concept prototype space. ii) Zero-example event detection increases by analyzing each frame of a video individually in concept prototype space, rather than considering the holistic videos. iii) Unsupervised video event summarization using concept prototypes is more accurate than using video-level concept detectors. } }
36. Pascal Mettes, Jan C. van Gemert, Spencer Cappallo, Thomas Mensink, and Cees G. M. Snoek, "Bag-of-Fragments: Selecting and encoding video fragments for event detection and recounting," in Proceedings of the ACM International Conference on Multimedia Retrieval, Shanghai, China, 2015.
@INPROCEEDINGS{MettesICMR15,   author = {Pascal Mettes and Jan C. van Gemert and Spencer Cappallo and Thomas Mensink and Cees G. M. Snoek},   title = {Bag-of-Fragments: Selecting and encoding video fragments for event detection and recounting},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},   month = {June},   year = {2015},   pages = {},   address = {Shanghai, China},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mettes-bag-of-fragments-icmr2015.pdf},   abstract = { The goal of this paper is event detection and recounting using a representation of concept detector scores. Different from existing work, which encodes videos by averaging concept scores over all frames, we propose to encode videos using fragments that are discriminatively learned per event. Our bag-of-fragments split a video into semantically coherent fragment proposals. From training video proposals we show how to select the most discriminative fragment for an event. An encoding of a video is in turn generated by matching and pooling these discriminative fragments to the fragment proposals of the video. The bag-of-fragments forms an effective encoding for event detection and is able to provide a precise temporally localized event recounting. Furthermore, we show how bag-of-fragments can be extended to deal with irrelevant concepts in the event recounting. Experiments on challenging web videos show that i) our modest number of fragment proposals give a high sub-event recall, ii) bag-of-fragments is complementary to global averaging and provides better event detection, iii) bag-of-fragments with concept filtering yields a desirable event recounting. We conclude that fragments matter for video event detection and recounting. } }
37. Mihir Jain, Jan C. van Gemert, and Cees G. M. Snoek, "What do 15,000 object categories tell us about classifying and localizing actions?," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015.
@INPROCEEDINGS{JainCVPR15,   author = {Mihir Jain and Jan C. van Gemert and Cees G. M. Snoek},   title = {What do 15,000 object categories tell us about classifying and localizing actions?},   booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},   month = {June},   year = {2015},   pages = {},   address = {Boston, MA, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/jain-objects-actions-cvpr2015.pdf},   data = {https://staff.fnwi.uva.nl/m.jain/projects/15kObjectsForAction.html},   abstract = { This paper contributes to automatic classification and localization of human actions in video. Whereas motion is the key ingredient in modern approaches, we assess the benefits of having objects in the video representation. Rather than considering a handful of carefully selected and localized objects, we conduct an empirical study on the benefit of encoding 15,000 object categories for action using 6 datasets totaling more than 200 hours of video and covering 180 action classes. Our key contributions are i) the first in-depth study of encoding objects for actions, ii) we show that objects matter for actions, and are often semantically relevant as well. iii) We establish that actions have object preferences. Rather than using all objects, selection is advantageous for action recognition. iv) We reveal that object-action relations are generic, which allows to transferring these relationships from the one domain to the other. And, v) objects, when combined with motion, improve the state-of-the-art for both action classification and localization. } }
38. Amirhossein Habibian, Thomas Mensink, and Cees G. M. Snoek, "VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events," in Proceedings of the ACM International Conference on Multimedia, Orlando, Florida, USA, 2014, pp. 17-26.
Best paper award
@INPROCEEDINGS{HabibianMM14,   author = {Amirhossein Habibian and Thomas Mensink and Cees G. M. Snoek},   title = {VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia},   pages = {17--26},   month = {November},   year = {2014},   address = {Orlando, Florida, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/habibian-videostory-mm2014.pdf},   note = {Best paper award},   abstract = { This paper proposes a new video representation for few-example event recognition and translation. Different from existing representations, which rely on either low-level features, or pre-specified attributes, we propose to learn an embedding from videos and their descriptions. In our embedding, which we call VideoStory, correlated term labels are combined if their combination improves the video classifier prediction. Our proposed algorithm prevents the combination of correlated terms which are visually dissimilar by optimizing a joint-objective balancing descriptiveness and predictability. The algorithm learns from textual descriptions of video content, which we obtain for free from the web by a simple spidering procedure. We use our VideoStory representation for few-example recognition of events on more than 65K challenging web videos from the NIST TRECVID event detection task and the Columbia Consumer Video collection. Our experiments establish that i) VideoStory outperforms an embedding without joint-objective and alternatives without any embedding, ii) The varying quality of input video descriptions from the web is compensated by harvesting more data, iii) VideoStory sets a new state-of-the-art for few-example event recognition, outperforming very recent attribute and low-level motion encodings. What is more, VideoStory translates a previously unseen video to its most likely description from visual content only. } }
39. Zhenyang Li, Efstratios Gavves, Thomas Mensink, and Cees G. M. Snoek, "Attributes Make Sense on Segmented Objects," in European Conference on Computer Vision, ZÃ¼rich, Switzerland, 2014.
@INPROCEEDINGS{LiECCV14,   author = {Zhenyang Li and Efstratios Gavves and Thomas Mensink and Cees G. M. Snoek},   title = {Attributes Make Sense on Segmented Objects},   booktitle = {European Conference on Computer Vision},   pages = {},   month = {September},   year = {2014},   address = {Z\"urich, Switzerland},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-object-level-attributes-eccv2014.pdf},   abstract = { In this paper we aim for object classification and segmentation by attributes. Where existing work considers attributes either for the global image or for the parts of the object, we propose, as our first novelty, to learn and extract attributes on segments containing the entire object. Object-level attributes suffer less from accidental content around the object and accidental image conditions such as partial occlusions, scale changes and viewpoint changes. As our second novelty, we propose joint learning for simultaneous object classification and segment proposal ranking, solely on the basis of attributes. This naturally brings us to our third novelty: object-level attributes for zero-shot, where we use attribute descriptions of unseen classes for localizing their instances in new images and classifying them accordingly. Results on the Caltech UCSD Birds, Leeds Butterflies, and an a-Pascal subset demonstrate that i) extracting attributes on oracle object-level brings substantial benefits ii) our joint learning model leads to accurate attribute-based classification and segmentation, approaching the oracle results and iii) object-level attributes also allow for zero-shot classification and segmentation. We conclude that attributes make sense on segmented objects. } }
40. Mihir Jain, Jan C. van Gemert, HervÃ© JÃ©gou, Patrick Bouthemy, and Cees G. M. Snoek, "Action Localization by Tubelets from Motion," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Ohio, USA, 2014.
@INPROCEEDINGS{JainCVPR14,   author = {Mihir Jain and Jan C. van Gemert and Herv\'e J\'egou and Patrick Bouthemy and Cees G. M. Snoek},   title = {Action Localization by Tubelets from Motion},   booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},   month = {June},   year = {2014},   pages = {},   address = {Columbus, Ohio, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/jain-tubelets-cvpr2014.pdf},   abstract = { This paper considers the problem of action localization, where the objective is to determine when and where certain actions appear. We introduce a sampling strategy to produce 2D+t sequences of bounding boxes, called tubelets. Compared to state-of-the-art alternatives, this drastically reduces the number of hypotheses that are likely to include the action of interest. Our method is inspired by a recent technique introduced in the context of image localization. Beyond considering this technique for the first time for videos, we revisit this strategy for 2D+t sequences obtained from super-voxels. Our sampling strategy advantageously exploits a criterion that reflects how action related motion deviates from background motion. We demonstrate the interest of our approach by extensive experiments on two public datasets: UCF Sports and MSR-II. Our approach significantly outperforms the state-of-theart on both datasets, while restricting the search of actions to a fraction of possible bounding box sequences. } }
41. Thomas Mensink, Efstratios Gavves, and Cees G. M. Snoek, "COSTA: Co-Occurrence Statistics for Zero-Shot Classification," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Ohio, USA, 2014.
@INPROCEEDINGS{MensinkCVPR14,   author = {Thomas Mensink and Efstratios Gavves and Cees G. M. Snoek},   title = {COSTA: Co-Occurrence Statistics for Zero-Shot Classification},   booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},   month = {June},   year = {2014},   pages = {},   address = {Columbus, Ohio, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mensink-co-occurrence-cvpr2014.pdf},   abstract = { In this paper we aim for zero-shot classification, that is visual recognition of an unseen class by using knowledge transfer from known classes. Our main contribution is COSTA, which exploits co-occurrences of visual concepts in images for knowledge transfer. These inter-dependencies arise naturally between concepts, and are easy to obtain from existing annotations or web-search hit counts. We estimate a classifier for a new label, as a weighted combination of related classes, using the co-occurrences to define the weight. We propose various metrics to leverage these co-occurrences, and a regression model for learning a weight for each related class. We also show that our zero-shot classifiers can serve as priors for few-shot learning. Experiments on three multi-labeled datasets reveal that our proposed zero-shot methods, are approaching and occasionally outperforming fully supervised SVMs. We conclude that co-occurrence statistics suffice for zero-shot classification. } }
42. Koen E. A. van de Sande, Cees G. M. Snoek, and Arnold W. M. Smeulders, "Fisher and VLAD with FLAIR," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Ohio, USA, 2014.
@INPROCEEDINGS{SandeCVPR14,   author = {Koen E. A. van de Sande and Cees G. M. Snoek and Arnold W. M. Smeulders},   title = {Fisher and VLAD with FLAIR},   booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},   month = {June},   year = {2014},   pages = {},   address = {Columbus, Ohio, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/sande-flair-cvpr2014.pdf},   abstract = { A major computational bottleneck in many current algorithms is the evaluation of arbitrary boxes. Dense local analysis and powerful bag-of-word encodings, such as Fisher vectors and VLAD, lead to improved accuracy at the expense of increased computation time. Where a simplification in the representation is tempting, we exploit novel representations while maintaining accuracy. We start from state-of-the-art, fast selective search, but our method will apply to any initial box-partitioning. By representing the picture as sparse integral images, one per codeword, we achieve a Fast Local Area Independent Representation. FLAIR allows for very fast evaluation of any box encoding and still enables spatial pooling. In FLAIR we achieve exact VLADs difference coding, even with l2 and power-norms. Finally, by multiple codeword assignments, we achieve exact and approximate Fisher vectors with FLAIR. The results are a 18x speedup, which enables us to set a new state-of-the- art on the challenging 2010 PASCAL VOC objects and the fine-grained categorization of the CUB-2011 200 bird species. Plus, we rank number one in the official ImageNet 2013 detection challenge. } }
43. Ran Tao, Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders, "Locality in Generic Instance Search from One Example," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Ohio, USA, 2014.
@INPROCEEDINGS{TaoCVPR14,   author = {Ran Tao and Efstratios Gavves and Cees G. M. Snoek and Arnold W. M. Smeulders},   title = {Locality in Generic Instance Search from One Example},   booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},   month = {June},   year = {2014},   pages = {},   address = {Columbus, Ohio, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/tao-locality-cvpr2014.pdf},   abstract = { This paper aims for generic instance search from a single example. Where the state-of-the-art relies on global image representation for the search, we proceed by including locality at all steps of the method. As the first novelty, we consider many boxes per database image as candidate targets to search locally in the picture using an efficient point-indexed representation. The same representation allows, as the second novelty, the application of very large vocabularies in the powerful Fisher vector and VLAD to search locally in the feature space. As the third novelty we propose an exponential similarity function to further emphasize locality in the feature space. Locality is advantageous in instance search as it will rest on the matching unique details. We demonstrate a substantial increase in generic instance search performance from one example on three standard datasets with buildings, logos, and scenes from 0.443 to 0.620 in mAP. } }
44. Julien van Hout, Eric Yeh, Dennis Koelma Cees G. M. Snoek, Chen Sun, Ramakant Nevatia, Julie Wong, and Gregory Myers, "Late Fusion and Calibration for Multimedia Event Detection Using Few Examples," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Florence, Italy, 2014.
@INPROCEEDINGS{vanHoutICASSP14,   author = {Julien van Hout and Eric Yeh and Dennis Koelma Cees G. M. Snoek and Chen Sun and Ramakant Nevatia and Julie Wong and Gregory Myers},   title = {Late Fusion and Calibration for Multimedia Event Detection Using Few Examples},   booktitle = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing},   month = {May},   year = {2014},   pages = {},   address = {Florence, Italy},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/hout-fusion-calibration-icassp2014.pdf},   abstract = { The state-of-the-art in example-based multimedia event detection (MED) rests on heterogeneous classifiers whose scores are typically combined in a late-fusion scheme. Recent studies on this topic have failed to reach a clear consensus as to whether machine learning techniques can outperform rule-based fusion schemes with varying amount of training data. In this paper, we present two parametric approaches to late fusion: a normalization scheme for arithmetic mean fusion (logistic averaging) and a fusion scheme based on logistic regression, and compare them to widely used rule-based fusion schemes. We also describe how logistic regression can be used to calibrate the fused detection scores to predict an optimal threshold given a detection prior and costs on errors. We discuss the advantages and shortcomings of each approach when the amount of positives available for training varies from 10 positives (10Ex) to 100 positives (100Ex). Experiments were run using video data from the NIST TRECVID MED 2013 evaluation and results were reported in terms of a ranking metric: the mean average precision (mAP) and R0, a cost-based metric introduced in TRECVID MED 2013. } }
45. Amirhossein Habibian, Thomas Mensink, and Cees G. M. Snoek, "Composite Concept Discovery for Zero-Shot Video Event Detection," in Proceedings of the ACM International Conference on Multimedia Retrieval, Glasgow, UK, 2014.
@INPROCEEDINGS{HabibianICMR14long,   author = {Amirhossein Habibian and Thomas Mensink and Cees G. M. Snoek},   title = {Composite Concept Discovery for Zero-Shot Video Event Detection},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},   month = {April},   year = {2014},   pages = {},   address = {Glasgow, UK},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/habibian-composite-icmr14.pdf},   abstract = { We consider automated detection of events in video without the use of any visual training examples. A common approach is to represent videos as classification scores obtained from a vocabulary of pre-trained concept classifiers. Where others construct the vocabulary by training individual concept classifiers, we propose to train classifiers for combination of concepts composed by Boolean logic operators. We call these concept combinations composite concepts and contribute an algorithm that automatically discovers them from existing video-level concept annotations. We discover composite concepts by jointly optimizing the accuracy of concept classifiers and their effectiveness for detecting events. We demonstrate that by combining concepts into composite concepts, we can train more accurate classifiers for the concept vocabulary, which leads to improved zero-shot event detection. Moreover, we demonstrate that by using different logic operators, namely ?AND?, ?OR?, we discover different types of composite concepts, which are complementary for zero-shot event detection. We perform a search for 20 events in 41K web videos from two test sets of the challenging TRECVID Multimedia Event Detection 2013 corpus. The experiments demonstrate the superior performance of the discovered composite concepts, compared to present-day alternatives, for zero-shot event detection. } }
46. Amirhossein Habibian and Cees G. M. Snoek, "Stop-Frame Removal Improves Web Video Classification," in Proceedings of the ACM International Conference on Multimedia Retrieval, Glasgow, UK, 2014.
@INPROCEEDINGS{HabibianICMR14short,   author = {Amirhossein Habibian and Cees G. M. Snoek},   title = {Stop-Frame Removal Improves Web Video Classification},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},   month = {April},   year = {2014},   pages = {},   address = {Glasgow, UK},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/habibian-stopframe-icmr14.pdf},   abstract = { Web videos available in sharing sites like YouTube, are becoming an alternative to manually annotated training data, which are necessary for creating video classifiers. However, when looking into web videos, we observe they contain several irrelevant frames that may randomly appear in any video, i.e., blank and over exposed frames. We call these irrelevant frames stop-frames and propose a simple algorithm to identify and exclude them during classifier training. Stop-frames might appear in any video, so it is hard to recognize their category. Therefore we identify stop-frames as those frames, which are commonly misclassified by any concept classifier. Our experiments demonstrates that using our algorithm improves classification accuracy by 60% and 24% in terms of mean average precision for an event and concept detection benchmark. } }
47. Masoud Mazloom, Xirong Li, and Cees G. M. Snoek, "Few-Example Video Event Retrieval Using Tag Propagation," in Proceedings of the ACM International Conference on Multimedia Retrieval, Glasgow, UK, 2014.
@INPROCEEDINGS{MazloomICMR14,   author = {Masoud Mazloom and Xirong Li and Cees G. M. Snoek},   title = {Few-Example Video Event Retrieval Using Tag Propagation},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},   month = {April},   year = {2014},   pages = {},   address = {Glasgow, UK},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mazloom-tagpropagation-icmr14.pdf},   abstract = { An emerging topic in multimedia retrieval is to detect a complex event in video using only a handful of video examples. Different from existing work which learns a ranker from positive video examples and hundreds of negative examples, we aim to query web video for events using zero or only a few visual examples. To that end, we propose in this paper a tag-based video retrieval system which propagates tags from a tagged video source to an unlabeled video collection without the need of any training examples. Our algorithm is based on weighted frequency neighbor voting using concept vector similarity. Once tags are propagated to unlabeled video we can rely on off-the-shelf language models to rank these videos by the tag similarity. We study the behavior of our tag-based video event retrieval system by performing three experiments on web videos from the TRECVID multimedia event detection corpus, with zero, one and multiple query examples that beats a recent alternative. } }
48. Chen Sun, Brian Burns, Ram Nevatia, Cees G. M. Snoek, Bob Bolles, Greg Myers, Wen Wang, and Eric Yeh, "ISOMER: Informative Segment Observations for Multimedia Event Recounting," in Proceedings of the ACM International Conference on Multimedia Retrieval, Glasgow, UK, 2014.
@INPROCEEDINGS{SunICMR14,   author = {Chen Sun and Brian Burns and Ram Nevatia and Cees G. M. Snoek and Bob Bolles and Greg Myers and Wen Wang and Eric Yeh},   title = {ISOMER: Informative Segment Observations for Multimedia Event Recounting},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},   month = {April},   year = {2014},   pages = {},   address = {Glasgow, UK},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/sun-informative-segment-icmr14.pdf},   abstract = { This paper describes a system for multimedia event detection and recounting. The goal is to detect a high level event class in unconstrained web videos and generate event oriented summarization for display to users. For this purpose, we detect informative segments and collect observations for them, leading to our ISOMER system. We combine a large collection of both low level and semantic level visual and audio features for event detection. For event recounting, we propose a novel approach to identify event oriented discriminative video segments and their descriptions with a linear SVM event classifier. User friendly concepts including objects, actions, scenes, speech and optical character recognition are used in generating descriptions. We also develop several mapping and filtering strategies to cope with noisy concept detectors. Our system performed competitively in the TRECVID 2013 Multimedia Event Detection task with near 100,000 videos and was the highest performer in TRECVID 2013 Multimedia Event Recounting task. } }
49. Efstratios Gavves, Basura Fernando, Cees G. M. Snoek, Arnold W. M. Smeulders, and Tinne Tuytelaars, "Fine-Grained Categorization by Alignments," in Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 2013.
@INPROCEEDINGS{GavvesICCV13,   author = {Efstratios Gavves and Basura Fernando and Cees G. M. Snoek and Arnold W. M. Smeulders and Tinne Tuytelaars},   title = {Fine-Grained Categorization by Alignments},   booktitle = {Proceedings of the {IEEE} International Conference on Computer Vision},   pages = {},   month = {December},   year = {2013},   address = {Sydney, Australia},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/gavves-fine-grained-alignment-iccv13.pdf},   abstract = { The aim of this paper is fine-grained categorization without human interaction. Different from prior work, which relies on detectors for specific object parts, we propose to localize distinctive details by roughly aligning the objects using just the overall shape, since implicit to fine-grained categorization is the existence of a super-class shape shared among all classes. The alignments are then used to transfer part annotations from training images to test images (supervised alignment), or to blindly yet consistently segment the object in a number of regions (unsupervised alignment). We furthermore argue that in the distinction of fine-grained sub-categories, classification-oriented encodings like Fisher vectors are better suited for describing localized information than popular matching oriented features like HOG. We evaluate the method on the CU-2011 Birds and Stanford Dogs fine-grained datasets, outperforming the state-of-the-art. } }
50. Zhenyang Li, Efstratios Gavves, Koen E. A. van de Sande, Cees G. M. Snoek, and Arnold W. M. Smeulders, "Codemaps Segment, Classify and Search Objects Locally," in Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 2013.
@INPROCEEDINGS{LiICCV13,   author = {Zhenyang Li and Efstratios Gavves and Koen E. A. van de Sande and Cees G. M. Snoek and Arnold W. M. Smeulders},   title = {Codemaps Segment, Classify and Search Objects Locally},   booktitle = {Proceedings of the {IEEE} International Conference on Computer Vision},   pages = {},   month = {December},   year = {2013},   address = {Sydney, Australia},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-codemaps-iccv13.pdf},   abstract = { In this paper we aim for segmentation and classification of objects. We propose codemaps that are a joint formulation of the classification score and the local neighborhood it belongs to in the image. We obtain the codemap by reordering the encoding, pooling and classification steps over lattice elements. Other than existing linear decompositions who emphasize only the efficiency benefits for localized search, we make three novel contributions. As a preliminary, we provide a theoretical generalization of the sufficient mathematical conditions under which image encodings and classification becomes locally decomposable. As first novelty we introduce l2 normalization for arbitrarily shaped image regions, which is fast enough for semantic segmentation using our Fisher codemaps. Second, using the same lattice across images, we propose kernel pooling which embeds nonlinearities into codemaps for object classification by explicit or approximate feature mappings. Results demonstrate that l2 normalized Fisher codemaps improve the state-of-the-art in semantic segmentation for PASCAL VOC. For object classification the addition of nonlinearities brings us on par with the state-of-the-art, but is 3x faster. Because of the codemaps? inherent efficiency, we can reach significant speed-ups for localized search as well. We exploit the efficiency gain for our third novelty: object segment retrieval using a single query image only. } }
51. Xirong Li and Cees G. M. Snoek, "Classifying Tag Relevance with Relevant Positive and Negative Examples," in Proceedings of the ACM International Conference on Multimedia, Barcelona, Spain, 2013.
@INPROCEEDINGS{LiACM13,   author = {Xirong Li and Cees G. M. Snoek},   title = {Classifying Tag Relevance with Relevant Positive and Negative Examples},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia},   month = {October},   year = {2013},   pages = {},   address = {Barcelona, Spain},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-classifying-tag-relevance-mm2013.pdf},   abstract = { Image tag relevance estimation aims to automatically determine what people label about images is factually present in the pictorial content. Different from previous works, which either use only positive examples of a given tag or use positive and random negative examples, we argue the importance of relevant positive and relevant negative examples for tag relevance estimation. We propose a system that selects positive and negative examples, deemed most relevant with respect to the given tag from crowd-annotated images. While applying models for many tags could be cumbersome, our system trains efficient ensembles of Support Vector Machines per tag, enabling fast classification. Experiments on two benchmark sets show that the proposed system compares favorably against five present day methods. Given extracted visual features, for each image our system can process up to 3,787 tags per second. The new system is both effective and efficient for tag relevance estimation. } }
52. Masoud Mazloom, Amirhossein Habibian, and Cees G. M. Snoek, "Querying for Video Events by Semantic Signatures from Few Examples," in Proceedings of the ACM International Conference on Multimedia, Barcelona, Spain, 2013.
@INPROCEEDINGS{MazloomACM13,   author = {Masoud Mazloom and Amirhossein Habibian and Cees G. M. Snoek},   title = {Querying for Video Events by Semantic Signatures from Few Examples},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia},   month = {October},   year = {2013},   pages = {},   address = {Barcelona, Spain},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mazloom-query-by-semantic-mm13.pdf},   abstract = { We aim to query web video for complex events using only a handful of video query examples, where the standard approach learns a ranker from hundreds of examples. We consider a semantic signature representation, consisting of off-the-shelf concept detectors, to capture the variance in semantic appearance of events. Since it is unknown what similarity metric and query fusion to use in such an event retrieval setting, we perform three experiments on unconstrained web videos from the TRECVID event detection task. It reveals that: retrieval with semantic signatures using normalized correlation as similarity metric outperforms a low-level bag-of-words alternative, multiple queries are best combined using late fusion with an average operator, and event retrieval is preferred over event classi cation when less than eight positive video examples are available. } }
53. Svetlana Kordumova, Xirong Li, and Cees G. M. Snoek, "Evaluating Sources and Strategies for Learning Video Concepts from Social Media," in International Workshop on Content-Based Multimedia Indexing, VeszprÃ©m, Hungary, 2013.
@INPROCEEDINGS{KordumovaCBMI13,   author = {Svetlana Kordumova and Xirong Li and Cees G. M. Snoek},   title = {Evaluating Sources and Strategies for Learning Video Concepts from Social Media},   booktitle = {International Workshop on Content-Based Multimedia Indexing},   month = {June},   year = {2013},   pages = {},   address = {Veszpr\'em, Hungary},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/kordumova-sources-strategies-cbmi2013.pdf},   abstract = { Learning video concept detectors from social media sources, such as Flickr images and YouTube videos, has the potential to address a wide variety of concept queries for video search. While the potential has been recognized by many, and progress on the topic has been impressive, we argue that two key questions, i.e., What visual tagging source is most suited for selecting positive training examples to learn video concepts? and What strategy should be used for selecting positive examples from tagged sources?, remain open. As an initial attempt to answer the two questions, we conduct an experimental study using a video search engine which is capable of learning concept detectors from social media, be it socially tagged videos or socially tagged images. Within the video search engine we investigate six strategies of positive examples selection. The performance is evaluated on the challenging TRECVID benchmark 2011 with 400 hours of Internet videos. The new experiments lead to novel and nontrivial findings: (1) tagged images are a better source for learning video concepts from the web, (2) selecting tag relevant examples as positives for learning video concepts is always beneficial and it can be done automatically and (3) the best source and strategy compare favorably against several present-day methods. } }
54. Amirhossein Habibian, Koen E. A. van de Sande, and Cees G. M. Snoek, "Recommendations for Video Event Recognition Using Concept Vocabularies," in Proceedings of the ACM International Conference on Multimedia Retrieval, Dallas, Texas, USA, 2013, pp. 89-96.
@INPROCEEDINGS{HabibianICMR13,   author = {Amirhossein Habibian and Koen E. A. van de Sande and Cees G. M. Snoek},   title = {Recommendations for Video Event Recognition Using Concept Vocabularies},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},   month = {April},   year = {2013},   pages = {89--96},   address = {Dallas, Texas, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/habibian-vocabulary-recommendations-events-icmr2013.pdf},   data = {http://www.science.uva.nl/research/mediamill/datasets/index.php},   abstract = { Representing videos using vocabularies composed of concept detectors appears promising for event recognition. While many have recently shown the benefits of concept vocabularies for recognition, the important question what concepts to include in the vocabulary is ignored. In this paper, we study how to create an effective vocabulary for arbitrary event recognition in web video. We consider four research questions related to the number, the type, the specificity and the quality of the detectors in concept vocabularies. A rigorous experimental protocol using a pool of 1,346 concept detectors trained on publicly available annotations, a dataset containing 13,274 web videos from the Multimedia Event Detection benchmark, 25 event groundtruth definitions, and a state-of-the-art event recognition pipeline allow us to analyze the performance of various concept vocabulary definitions. From the analysis we arrive at the recommendation that for effective event recognition the concept vocabulary should i) contain more than 200 concepts, ii) be diverse by covering object, action, scene, people, animal and attribute concepts, iii) include both general and specific concepts, and iv) increase the number of concepts rather than improve the quality of the individual detectors. We consider the recommendations for video event recognition using concept vocabularies the most important contribution of the paper, as they provide guidelines for future work. } }
55. Masoud Mazloom, Efstratios Gavves, Koen E. A. van de Sande, and Cees G. M. Snoek, "Searching Informative Concept Banks for Video Event Detection," in Proceedings of the ACM International Conference on Multimedia Retrieval, Dallas, Texas, USA, 2013, pp. 255-262.
@INPROCEEDINGS{MazloomICMR13,   author = {Masoud Mazloom and Efstratios Gavves and Koen E. A. van de Sande and Cees G. M. Snoek},   title = {Searching Informative Concept Banks for Video Event Detection},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},   month = {April},   year = {2013},   pages = {255--262},   address = {Dallas, Texas, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mazloom-concept-banks-icmr2013.pdf},   abstract = { An emerging trend in video event detection is to learn an event from a bank of concept detector scores. Different from existing work, which simply relies on a bank containing all available detectors, we propose in this paper an algorithm that learns from examples what concepts in a bank are most informative per event. We model finding this bank of informative concepts out of a large set of concept detectors as a rare event search. Our proposed approximate solution finds the optimal concept bank using a cross-entropy optimization. We study the behavior of video event detection based on a bank of informative concepts by performing three experiments on more than 1,000 hours of arbitrary internet video from the TRECVID multimedia event detection task. Starting from a concept bank of 1,346 detectors we show that 1.) some concept banks are more informative than others for specific events, 2.) event detection using an automatically obtained informative concept bank is more robust than using all available concepts, 3.) even for small amounts of training examples an informative concept bank outperforms a full bank and a bag-of-word event representation, and 4.) we show qualitatively that the informative concept banks make sense for the events of interest, without being programmed to do so. We conclude that for concept banks it pays to be informative. } }
56. Davide Modolo and Cees G. M. Snoek, "Can Object Detectors Aid Internet Video Event Retrieval?," in Proceedings of the IS&T/SPIE Symposium on Electronic Imaging, San Francisco, CA, USA, 2013.
@INPROCEEDINGS{ModoloSPIE13,   author = {Davide Modolo and Cees G. M. Snoek},   title = {Can Object Detectors Aid Internet Video Event Retrieval?},   booktitle = {Proceedings of the IS\&T/SPIE Symposium on Electronic Imaging},   pages = {},   month = {February},   year = {2013},   address = {San Francisco, CA, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/modolo-object-event-spie2013.pdf},   abstract = { The problem of event representation for automatic event detection in Internet videos is acquiring an increasing importance, due to their applicability to a large number of applications. Existing methods focus on representing events in terms of either low-level descriptors or domain-speci c models suited for a limited class of video only, ignoring the high-level meaning of the events. Ultimately aiming for a more robust and meaningful representation, in this paper we question whether object detectors can aid video event retrieval. We propose an experimental study that investigates the utility of present-day local and global object detectors for video event search. By evaluating object detectors optimized for high-quality photographs on low-quality Internet video, we establish that present-day detectors can successfully be used for recognizing objects in web videos. We use an object-based representation to re-rank the results of an appearance-based event detector. Results on the challenging TRECVID multimedia event detection corpus demonstrate that objects can indeed aid event retrieval. While much remains to be studied, we believe that our experimental study is a rst step towards revealing the potential of object-based event representations. } }
57. Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders, "Convex Reduction of High-Dimensional Kernels for Visual Classification," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, Rhode Island, USA, 2012.
@INPROCEEDINGS{GavvesCVPR12,   author = {Efstratios Gavves and Cees G. M. Snoek and Arnold W. M. Smeulders},   title = {Convex Reduction of High-Dimensional Kernels for Visual Classification},   booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},   pages = {},   month = {June},   year = {2012},   address = {Providence, Rhode Island, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/gavves-convex-kernel-cvpr2012.pdf},   abstract = { Limiting factors of fast and effective classifiers for large sets of images are their dependence on the number of images analyzed and the dimensionality of the image representation. Considering the growing number of images as a given, we aim to reduce the image feature dimensionality in this paper. We propose reduced linear kernels that use only a portion of the dimensions to reconstruct a linear kernel. We formulate the search for these dimensions as a convex optimization problem, which can be solved efficiently. Different from existing kernel reduction methods, our reduced kernels are faster and maintain the accuracy benefits from non-linear embedding methods that mimic non-linear SVMs. We show these properties on both the Scenes and PASCAL VOC 2007 datasets. In addition, we demonstrate how our reduced kernels allow to compress Fisher vector for use with non-linear embeddings, leading to high accuracy. What is more, without using any labeled examples the selected and weighed kernel dimensions appear to correspond to visually meaningful patches in the images. } }
58. Xirong Li, Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders, "Fusing Concept Detection and Geo Context for Visual Search," in Proceedings of the ACM International Conference on Multimedia Retrieval, Hong Kong, China, 2012.
Best paper runner-up
@INPROCEEDINGS{LiICMR12,   author = {Xirong Li and Cees G. M. Snoek and Marcel Worring and Arnold W. M. Smeulders},   title = {Fusing Concept Detection and Geo Context for Visual Search},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},   month = {June},   year = {2012},   pages = {},   address = {Hong Kong, China},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-geo-context-icmr2012.pdf},   note = {Best paper runner-up},   abstract = { Given the proliferation of geo-tagged images, the question of how to exploit geo tags and the underlying geo context for visual search is emerging. Based on the observation that the importance of geo context varies over concepts, we propose a concept-based image search engine which fuses visual concept detection and geo context in a concept-dependent manner. Compared to individual content-based and geo-based concept detectors and their uniform combination, concept-dependent fusion shows improvements. Moreover, since the proposed search engine is trained on social-tagged images alone without the need of human interaction, it is flexible to cope with many concepts. Search experiments on 101 popular visual concepts justify the viability of the proposed solution. In particular, for 79 out of the 101 concepts, the learned weights yield improvements over the uniform weights, with a relative gain of at least 5\% in terms of average precision. } }
59. Daan T. J. Vreeswijk, Koen E. A. van de Sande, Cees G. M. Snoek, and Arnold W. M. Smeulders, "All Vehicles are Cars: Subclass Preferences in Container Concepts," in Proceedings of the ACM International Conference on Multimedia Retrieval, Hong Kong, China, 2012.
@INPROCEEDINGS{VreeswijkICMR12,   author = {Daan T. J. Vreeswijk and Koen E. A. van de Sande and Cees G. M. Snoek and Arnold W. M. Smeulders},   title = {All Vehicles are Cars: Subclass Preferences in Container Concepts},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},   month = {June},   year = {2012},   pages = {},   address = {Hong Kong, China},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/vreeswijk-vehicles-are-cars-icmr2012.pdf},   abstract = { This paper investigates the natural bias humans display when labeling images with a container label like vehicle or carnivore. Using three container concepts as subtree root nodes, and all available concepts between these roots and the images from the ImageNet Large Scale Visual Recogni- tion Challenge (ILSVRC) dataset, we analyze the differences between the images labeled at these varying levels of abstraction and the union of their constituting leaf nodes. We find that for many container concepts, a strong preference for one or a few different constituting leaf nodes occurs. These results indicate that care is needed when using hierarchical knowledge in image classification: if the aim is to classify vehicles the way humans do, then cars and buses may be the only correct results. } }
60. Bauke Freiburg, Jaap Kamps, and Cees G. M. Snoek, "Crowdsourcing Visual Detectors for Video Search," in Proceedings of the ACM International Conference on Multimedia, Scottsdale, AZ, USA, 2011.
@INPROCEEDINGS{FreiburgACM11,   author = {Bauke Freiburg and Jaap Kamps and Cees G. M. Snoek},   title = {Crowdsourcing Visual Detectors for Video Search},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia},   month = {December},   year = {2011},   pages = {},   address = {Scottsdale, AZ, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/freiburg-crowdsourcing-acm2011.pdf},   abstract = { In this paper, we study social tagging at the video fragment-level using a combination of automated content understanding and the wisdom of the crowds. We are interested in the question whether crowdsourcing can be beneficial to a video search engine that automatically recognizes video fragments on a semantic level. To answer this question, we perform a 3-month online field study with a concert video search engine targeted at a dedicated user-community of pop concert enthusiasts. We harvest the feedback of more than 500 active users and perform two experiments. In experiment 1 we measure user incentive to provide feedback, in experiment 2 we determine the tradeoff between feedback quality and quantity when aggregated over multiple users. Results show that users provide sufficient feedback, which becomes highly reliable when a crowd agreement of 67\% is enforced. } }
61. Xirong Li, Efstratios Gavves, Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders, "Personalizing Automated Image Annotation using Cross-Entropy," in Proceedings of the ACM International Conference on Multimedia, Scottsdale, AZ, USA, 2011.
@INPROCEEDINGS{LiACM11,   author = {Xirong Li and Efstratios Gavves and Cees G. M. Snoek and Marcel Worring and Arnold W. M. Smeulders},   title = {Personalizing Automated Image Annotation using Cross-Entropy},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia},   month = {December},   year = {2011},   pages = {},   address = {Scottsdale, AZ, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-personalized-acm2011.pdf},   abstract = { Annotating the increasing amounts of user-contributed images in a personalized manner is in great demand. However, this demand is largely ignored by the mainstream of automated image annotation research. In this paper we aim for personalizing automated image annotation by jointly exploiting personalized tag statistics and content-based image annotation. We propose a cross-entropy based learning algorithm which personalizes a generic annotation model by learning from a user’s multimedia tagging history. Using cross-entropy-minimization basedMonte Carlo sampling, the proposed algorithm optimizes the personalization process in terms of a performance measurement which can be flexibly chosen. Automatic image annotation experiments with 5,315 realistic users in the social web show that the proposed method compares favorably to a generic image annotation method and a method using personalized tag statistics only. For 4,442 users the performance improves, where for 1,088 users the absolute performance gain is at least 0.05 in terms of average precision. The results show the value of the proposed method. } }
62. Xirong Li, Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders, "Social Negative Bootstrapping for Visual Categorization," in Proceedings of the ACM International Conference on Multimedia Retrieval, Trento, Italy, 2011.
@INPROCEEDINGS{LiICMR11,   author = {Xirong Li and Cees G. M. Snoek and Marcel Worring and Arnold W. M. Smeulders},   title = {Social Negative Bootstrapping for Visual Categorization},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},   month = {April},   year = {2011},   pages = {},   address = {Trento, Italy},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-social-negative-icmr2011.pdf},   abstract = { To learn classifiers for many visual categories, obtaining labeled training examples in an efficient way is crucial. Since a classifier tends to misclassify negative examples which are visually similar to positive examples, inclusion of such informative negatives should be stressed in the learning process. However, they are unlikely to be hit by random sampling, the de facto standard in literature. In this paper, we go beyond random sampling by introducing a novel social negative bootstrapping approach. Given a visual category and a few positive examples, the proposed approach adaptively and iteratively harvests informative negatives from a large amount of social-tagged images. To label negative examples without human interaction, we design an effective virtual labeling procedure based on simple tag reasoning. Virtual labeling, in combination with adaptive sampling, enables us to select the most misclassified negatives as the informative samples. Learning from the positive set and the informative negative sets results in visual classifiers with higher accuracy. Experiments on two present-day image benchmarks employing 650K virtually labeled negative examples show the viability of the proposed approach. On a popular visual categorization benchmark our precision at 20 increases by 34\%, compared to baselines trained on randomly sampled negatives. We achieve more accurate visual categorization without the need of manually labeling any negatives. } }
63. Wolfgang HÃ¼rst, Cees G. M. Snoek, Willem-Jan Spoel, and Mate Tomin, "Size Matters! How Thumbnail Number, Size, and Motion Influence Mobile Video Retrieval," in International Conference on MultiMedia Modeling, Taipei, Taiwan, 2011.
@INPROCEEDINGS{HurstMMM11,   author = {Wolfgang H\"urst and Cees G. M. Snoek and Willem-Jan Spoel and Mate Tomin},   title = {Size Matters! How Thumbnail Number, Size, and Motion Influence Mobile Video Retrieval},   booktitle = {International Conference on MultiMedia Modeling},   month = {January},   year = {2011},   pages = {},   address = {Taipei, Taiwan},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/huerst-size-matters-mmm2011.pdf},   demo = {http://vimeo.com/19595895},   abstract = { Various interfaces for video browsing and retrieval have been proposed that provide improved usability, better retrieval performance, and richer user experience compared to simple result lists that are just sorted by relevance. These browsing interfaces take advantage of the rather large screen estate on desktop and laptop PCs to visualize advanced configurations of thumbnails summarizing the video content. Naturally, the usefulness of such screen-intensive visual browsers can be called into question when applied on small mobile handheld devices, such as smart phones. In this paper, we address the usefulness of thumbnail images for mobile video retrieval interfaces. In particular, we investigate how thumbnail number, size, and motion influence the performance of humans in common recognition tasks. Contrary to widespread believe that screens of handheld devices are unsuited for visualizing multiple (small) thumbnails simultaneously, our study shows that users are quite able to handle and assess multiple small thumbnails at the same time, especially when they show moving images. Our results give suggestions for appropriate video retrieval interface designs on handheld devices. } }
64. Efstratios Gavves and Cees G. M. Snoek, "Landmark Image Retrieval Using Visual Synonyms," in Proceedings of the ACM International Conference on Multimedia, Firenze, Italy, 2010.
@INPROCEEDINGS{GavvesACM10,   author = {Efstratios Gavves and Cees G. M. Snoek},   title = {Landmark Image Retrieval Using Visual Synonyms},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia},   month = {October},   year = {2010},   pages = {},   address = {Firenze, Italy},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/gavves-synonyms-acm10.pdf},   abstract = { In this paper, we consider the incoherence problem of the visual words in bag-of-words vocabularies. Different from existing work, which performs assignment of words based solely on closeness in descriptor space, we focus on identifying pairs of independent, distant words -- the visual synonyms -- that are still likely to host image patches with similar appearance. To study this problem, we focus on landmark images, where we can examine whether image geometry is an appropriate vehicle for detecting visual synonyms. We propose an algorithm for the extraction of visual synonyms in landmark images. To show the merit of visual synonyms, we perform two experiments. We examine closeness of synonyms in descriptor space and we show a first application of visual synonyms in a landmark image retrieval setting. Using visual synonyms, we perform on par with the state-of-the-art, but with six times less visual words. } }
65. Wolfgang HÃ¼rst, Cees G. M. Snoek, Willem-Jan Spoel, and Mate Tomin, "Keep Moving! Revisiting Thumbnails for Mobile Video Retrieval," in Proceedings of the ACM International Conference on Multimedia, Firenze, Italy, 2010.
@INPROCEEDINGS{HurstACM10,   author = {Wolfgang H\"urst and Cees G. M. Snoek and Willem-Jan Spoel and Mate Tomin},   title = {Keep Moving! Revisiting Thumbnails for Mobile Video Retrieval},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia},   month = {October},   year = {2010},   pages = {},   address = {Firenze, Italy},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/huerst-keep-moving-acm2010.pdf},   demo = {http://vimeo.com/19595895},   abstract = { Motivated by the increasing popularity of video on handheld devices and the resulting importance for effective video retrieval, this paper revisits the relevance of thumbnails in a mobile video retrieval setting. Our study indicates that users are quite able to handle and assess small thumbnails on a mobile's screen -- especially with moving images -- suggesting promising avenues for future research in design of mobile video retrieval interfaces. } }
66. Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek, "Accelerating Visual Categorization with the GPU," in ECCV Workshop on Computer Vision on GPU, Crete, Greece, 2010.
@INPROCEEDINGS{SandeCVGPU10,   author = {Koen E. A. van de Sande and Theo Gevers and Cees G. M. Snoek},   title = {Accelerating Visual Categorization with the {GPU}},   booktitle = {{ECCV} Workshop on Computer Vision on {GPU}},   pages = {},   month = {September},   year = {2010},   address = {Crete, Greece},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/sande-accelerating-categorization-CVGPU2010.pdf},   abstract = { Visual categorization is important to manage large collections of digital images and video, where textual meta-data is often incomplete or simply unavailable. The bag-of-words model has become the most powerful method for visual categorization of images and video. Despite its high accuracy, a severe drawback of this model is its high computational cost. As the trend to increase computational power in newer CPU and GPU architectures is to increase their level of parallelism, exploiting this parallelism becomes an important direction to handle the computational cost of the bag-of-words approach. In this paper, we analyze the bag-of-words model for visual categorization in terms of computational cost and identify two major bottlenecks: the quantization step and the classification step. We address these two bottlenecks by proposing two efficient algorithms for quantization and classification by exploiting the GPU hardware and the CUDA parallel programming model. The algorithms are designed to keep categorization accuracy intact and give the same numerical results. In the experiments on large scale datasets it is shown that, by using a parallel implementation on the GPU, quantization is 28 times faster and classification is 35 times faster than a single-threaded CPU version, while giving the exact same numerical results. The GPU accelerations are applicable to both the learning phase and the testing phase of visual categorization systems. For software visit http://www.colordescriptors.com/. } }
67. Bouke Huurnink, Cees G. M. Snoek, Maarten de Rijke, and Arnold W. M. Smeulders, "Today’s and Tomorrow’s Retrieval Practice in the Audiovisual Archive," in Proceedings of the ACM International Conference on Image and Video Retrieval, Xi’an, China, 2010, pp. 18-25.
Best paper runner-up
@INPROCEEDINGS{HuurninkCIVR10,   author = {Bouke Huurnink and Cees G. M. Snoek and Maarten {de Rijke} and Arnold W. M. Smeulders},   title = {Today's and Tomorrow's Retrieval Practice in the Audiovisual Archive},   booktitle = {Proceedings of the {ACM} International Conference on Image and Video Retrieval},   pages = {18--25},   month = {July},   year = {2010},   address = {Xi'an, China},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/huurnink-archive-civr2010.pdf},   data = {http://ilps.science.uva.nl/resources/avarchive},   note = {Best paper runner-up},   abstract = { Content-based video retrieval is maturing to the point where it can be used in real-world retrieval practices. One such practice is the audiovisual archive, whose users increasingly require fine-grained access to broadcast television content. We investigate to what extent content-based video retrieval methods can improve search in the audiovisual archive. In particular, we propose an evaluation methodology tailored to the specific needs and circumstances of the audiovisual archive, which are typically missed by existing evaluation initiatives. We utilize logged searches and content purchases from an existing audiovisual archive to create realistic query sets and relevance judgments. To reflect the retrieval practice of both the archive and the video retrieval community as closely as possible, our experiments with three video search engines incorporate archive-created catalog entries as well as state-of-the-art multimedia content analysis results. We find that incorporating content-based video retrieval into the archive’s practice results in significant performance increases for shot retrieval and for retrieving entire television programs. Our experiments also indicate that individual content-based retrieval methods yield approximately equal performance gains. We conclude that the time has come for audiovisual archives to start accommodating content-based video retrieval methods into their daily practice. } }
68. Xirong Li, Cees G. M. Snoek, and Marcel Worring, "Unsupervised Multi-Feature Tag Relevance Learning for Social Image Retrieval," in Proceedings of the ACM International Conference on Image and Video Retrieval, Xi’an, China, 2010, pp. 10-17.
Best paper award
@INPROCEEDINGS{LiCIVR10,   author = {Xirong Li and Cees G. M. Snoek and Marcel Worring},   title = {Unsupervised Multi-Feature Tag Relevance Learning for Social Image Retrieval},   booktitle = {Proceedings of the {ACM} International Conference on Image and Video Retrieval},   pages = {10--17},   month = {July},   year = {2010},   address = {Xi'an, China},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-multifeature-civr10.pdf},   note = {Best paper award},   abstract = { Interpreting the relevance of a user-contributed tag with respect to the visual content of an image is an emerging problem in social image retrieval. In the literature this problem is tackled by analyzing the correlation between tags and images represented by specific visual features. Unfortunately, no single feature represents the visual content completely, e.g., global features are suitable for capturing the gist of scenes, while local features are better for depicting objects. To solve the problem of learning tag relevance given multiple features, we introduce in this paper two simple and effective methods: one is based on the classical Borda Count and the other is a method we name UniformTagger. Both methods combine the output of many tag relevance learners driven by diverse features in an unsupervised, rather than supervised, manner. Experiments on 3.5 million social-tagged images and two test sets verify our proposal. Using learned tag relevance as updated tag frequency for social image retrieval, both Borda Count and UniformTagger outperform retrieval without tag relevance learning and retrieval with single-feature tag relevance learning. Moreover, the two unsupervised methods are comparable to a state-of-the-art supervised alternative, but without the need of any training data. } }
69. Xirong Li and Cees G. M. Snoek, "Visual Categorization with Negative Examples for Free," in Proceedings of the ACM International Conference on Multimedia, Beijing, China, 2009.
@INPROCEEDINGS{LiACM09,   author = {Xirong Li and Cees G. M. Snoek},   title = {Visual Categorization with Negative Examples for Free},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia},   pages = {},   month = {October},   year = {2009},   address = {Beijing, China},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-negative-for-free-acm2009.pdf},   data = {http://staff.science.uva.nl/~xirong/neg4free/},   abstract = { Automatic visual categorization is critically dependent on labeled examples for supervised learning. As an alternative to traditional expert labeling, social-tagged multimedia is becoming a novel yet subjective and inaccurate source of learning examples. Different from existing work focusing on collecting positive examples, we study in this paper the potential of substituting social tagging for expert labeling for creating negative examples. We present an empirical study using 6.5 million Flickr photos as a source of social tagging. Our experiments on the PASCAL VOC challenge 2008 show that with a relative loss of only 4.3\% in terms of mean average precision, expert-labeled negative examples can be completely replaced by social-tagged negative examples for consumer photo categorization. } }
70. Arjan T. Setz and Cees G. M. Snoek, "Can Social Tagged Images Aid Concept-Based Video Search?," in Proceedings of the IEEE International Conference on Multimedia & Expo, New York, NY, USA, 2009, pp. 1460-1463.
@INPROCEEDINGS{SetzICME09,   author = {Arjan T. Setz and Cees G. M. Snoek},   title = {Can Social Tagged Images Aid Concept-Based Video Search?},   booktitle = {Proceedings of the {IEEE} International Conference on Multimedia \& Expo},   pages = {1460--1463},   month = {June--July},   year = {2009},   address = {New York, NY, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/setz-social-tags-icme2009.pdf},   abstract = { This paper seeks to unravel whether commonly available social tagged images can be exploited as a training resource for concept-based video search. Since social tags are known to be ambiguous, overly personalized, and often error prone, we place special emphasis on the role of disambiguation. We present a systematic experimental study that evaluates concept detectors based on social tagged images, and their disambiguated versions, in three application scenarios: within-domain, cross-domain, and together with an interacting user. The results indicate that social tagged images can aid concept-based video search indeed, especially after disambiguation and when used in an interactive video retrieval setting. These results open-up interesting avenues for future research. } }
71. Xirong Li, Cees G. M. Snoek, and Marcel Worring, "Annotating Images by Harnessing Worldwide User-Tagged Photos," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Taipei, Taiwan, 2009.
@INPROCEEDINGS{LiICASSP09,   author = {Xirong Li and Cees G. M. Snoek and Marcel Worring},   title = {Annotating Images by Harnessing Worldwide User-Tagged Photos},   booktitle = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing},   pages = {},   month = {April},   year = {2009},   address = {Taipei, Taiwan},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-annotating-images-icassp2009.pdf},   abstract = { Automatic image tagging is important yet challenging due to the semantic gap and the lack of learning examples to model a tag's visual diversity. Meanwhile, social user tagging is creating rich multimedia content on the web. In this paper, we propose to combine the two tagging approaches in a search-based framework. For an unlabeled image, we first retrieve its visual neighbors from a large user-tagged image database. We then select relevant tags from the result images to annotate the unlabeled image. To tackle the unreliability and sparsity of user tagging, we introduce a joint-modality tag relevance estimation method which efficiently addresses both textual and visual clues. Experiments on 1.5 million Flickr photos and 10 000 Corel images verify the proposed method. } }
72. Daragh Byrne, Aiden R. Doherty, Cees G. M. Snoek, Gareth J. F. Jones, and Alan F. Smeaton, "Validating the Detection of Everyday Concepts in Visual Lifelogs," in Proceedings of the International Conference on Semantic and Digital Media Technologies, SAMT 2008, Koblenz, Germany, December 3-5, 2008, , pp. 15-30.
@INPROCEEDINGS{ByrneSAMT08,   author = {Daragh Byrne and Aiden R. Doherty and Cees G. M. Snoek and Gareth J. F. Jones and Alan F. Smeaton},   title = {Validating the Detection of Everyday Concepts in Visual Lifelogs},   booktitle = {Proceedings of the International Conference on Semantic and Digital Media Technologies, SAMT 2008, Koblenz, Germany, December 3-5, 2008},   editor = {David Duke and Lynda Hardman and Alex Hauptmann and Dietrich Paulus and Steffen Staab},   series = {LNCS},   volume = {5392},   pages = {15--30},   publisher = {Springer-Verlag},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/byrne-everyday-concepts-samt2008.pdf},   address = {},   abstract = { The Microsoft SenseCam is a small lightweight wearable camera used to passively capture photos and other sensor readings from a user's day-to-day activities. It can capture up to 3,000 images per day, equating to almost 1 million images per year. It is used to aid memory by creating a personal multimedia lifelog, or visual recording of the wearer's life. However the sheer volume of image data captured within a visual lifelog creates a number of challenges, particularly for locating relevant content. Within this work, we explore the applicability of semantic concept detection, a method often used within video retrieval, on the novel domain of visual lifelogs. A concept detector models the correspondence between low-level visual features and high-level semantic concepts (such as indoors, outdoors, people, buildings, etc.) using supervised machine learning. By doing so it determines the probability of a concept's presence. We apply detection of 27 everyday semantic concepts on a lifelog collection composed of 257,518 SenseCam images from 5 users. The results were then evaluated on a subset of 95,907 images, to determine the precision for detection of each semantic concept and to draw some interesting inferences on the lifestyles of those 5 users. We additionally present future applications of concept detection within the domain of lifelogging. } }
73. Xirong Li, Cees G. M. Snoek, and Marcel Worring, "Learning Tag Relevance by Neighbor Voting for Social Image Retrieval," in Proceedings of the ACM International Conference on Multimedia Information Retrieval, Vancouver, Canada, 2008, pp. 180-187.
@INPROCEEDINGS{LiMIR08,   author = {Xirong Li and Cees G. M. Snoek and Marcel Worring},   title = {Learning Tag Relevance by Neighbor Voting for Social Image Retrieval},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia Information Retrieval},   pages = {180--187},   month = {October},   year = {2008},   address = {Vancouver, Canada},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-tag-relevance-mir2008.pdf},   data = {http://staff.science.uva.nl/~xirong/tagrel/},   abstract = { Social image retrieval is important for exploiting the increasing amounts of amateur-tagged multimedia such as Flickr images. Since amateur tagging is known to be uncontrolled, ambiguous, and personalized, a fundamental problem is how to reliably interpret the relevance of a tag with respect to the visual content it is describing. Intuitively, if different persons label similar images using the same tags, these tags are likely to reflect objective aspects of the visual content. Starting from this intuition, we propose a novel algorithm that scalably and reliably learns tag relevance by accumulating votes from visually similar neighbors. Further, treated as tag frequency, learned tag relevance is seamlessly embedded into current tag-based social image retrieval paradigms. Preliminary experiments on one million Flickr images demonstrate the potential of the proposed algorithm. Overall comparisons for both single-word queries and multiple-word queries show substantial improvement over the baseline by learning and using tag relevance. Specifically, compared with the baseline using the original tags, on average, retrieval using improved tags increases mean average precision by 24\%, from 0.54 to 0.67. Moreover, simulated experiments indicate that performance can be improved further by scaling up the amount of images used in the proposed neighbor voting algorithm. } }
74. Ork de Rooij, Cees G. M. Snoek, and Marcel Worring, "Balancing Thread Based Navigation for Targeted Video Search," in Proceedings of the ACM International Conference on Image and Video Retrieval, Niagara Falls, Canada, 2008, pp. 485-494.
@INPROCEEDINGS{RooijCIVR08,   author = {Ork de Rooij and Cees G. M. Snoek and Marcel Worring},   title = {Balancing Thread Based Navigation for Targeted Video Search},   booktitle = {Proceedings of the {ACM} International Conference on Image and Video Retrieval},   pages = {485--494},   month = {July},   year = {2008},   address = {Niagara Falls, Canada},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/rooij-thread-based-navigation-civr2008.pdf},   abstract = { Various query methods for video search exist. Because of the semantic gap each method has its limitations. We argue that for effective retrieval query methods need to be combined at retrieval time. However, switching query methods often involves a change in query and browsing interface, which puts a heavy burden on the user. In this paper, we propose a novel method for fast and effective search trough large video collections by embedding multiple query methods into a single browsing environment. To that end we introduced the notion of query threads, which contain a shot-based ranking of the video collection according to some feature-based similarity measure. On top of these threads we define several thread-based visualizations, ranging from fast targeted search to very broad exploratory search, with the ForkBrowser as the balance between fast search and video space exploration. We compare the effectiveness and efficiency of the ForkBrowser with the CrossBrowser on the TRECVID 2007 interactive search task. Results show that different query methods are needed for different types of search topics, and that the ForkBrowser requires signifficantly less user interactions to achieve the same result as the CrossBrowser. In addition, both browsers rank among the best interactive retrieval systems currently available. } }
75. Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek, "A Comparison of Color Features for Visual Concept Classification," in Proceedings of the ACM International Conference on Image and Video Retrieval, Niagara Falls, Canada, 2008, pp. 141-149.
@INPROCEEDINGS{SandeCIVR08,   author = {Koen E. A. van de Sande and Theo Gevers and Cees G. M. Snoek},   title = {A Comparison of Color Features for Visual Concept Classification},   booktitle = {Proceedings of the {ACM} International Conference on Image and Video Retrieval},   pages = {141--149},   month = {July},   year = {2008},   address = {Niagara Falls, Canada},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/sande-colorfeatures-civr2008.pdf},   software = {http://staff.science.uva.nl/~ksande/research/colordescriptors/},   abstract = { Concept classification is important to access visual information on the level of objects and scene types. So far, intensity-based features have been widely used. To increase discriminative power, color features have been proposed only recently. As many features exist, a structured overview is required of color features in the context of concept classification. Therefore, this paper studies 1. the invariance properties and 2. the distinctiveness of color features in a structured way. The invariance properties of color features with respect to photometric changes are summarized. The distinctiveness of color features is assessed experimentally using an image and a video benchmark: the PASCAL VOC Challenge 2007 and the Mediamill Challenge. Because color features cannot be studied independently from the points at which they are extracted, different point sampling strategies based on Harris-Laplace salient points, dense sampling and the spatial pyramid are also studied. From the experimental results, it can be derived that invariance to light intensity changes and light color changes affects concept classification. The results reveal further that the usefulness of invariance is concept-specific. } }
76. Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek, "Evaluation of Color Descriptors for Object and Scene Recognition," in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Anchorage, Alaska, USA, 2008.
@INPROCEEDINGS{SandeCVPR08,   author = {Koen E. A. van de Sande and Theo Gevers and Cees G. M. Snoek},   title = {Evaluation of Color Descriptors for Object and Scene Recognition},   booktitle = {Proceedings of the {IEEE} Computer Society Conference on Computer Vision and Pattern Recognition},   pages = {},   month = {June},   year = {2008},   address = {Anchorage, Alaska, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/sande-colordescriptors-cvpr2008.pdf},   software = {http://staff.science.uva.nl/~ksande/research/colordescriptors/},   abstract = { Image category recognition is important to access visual information on the level of objects and scene types. So far, intensity-based descriptors have been widely used. To increase illumination invariance and discriminative power, color descriptors have been proposed only recently. As many descriptors exist, a structured overview of color invariant descriptors in the context of image category recognition is required. Therefore, this paper studies the invariance properties and the distinctiveness of color descriptors in a structured way. The invariance properties of color descriptors are shown analytically using a taxonomy based on invariance properties with respect to photometric transformations. The distinctiveness of color descriptors is assessed experimentally using two benchmarks from the image domain and the video domain. From the theoretical and experimental results, it can be derived that invariance to light intensity changes and light color changes affects category recognition. The results reveal further that, for light intensity changes, the usefulness of invariance is category-specific. } }
77. Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek, "Color Descriptors for Object Category Recognition," in Proceedings of the IS&T European Conference on Colour in Graphics, Imaging, and Vision, Terrassa-Barcelona, Spain, 2008.
@INPROCEEDINGS{SandeCGIV08,   author = {Koen E. A. van de Sande and Theo Gevers and Cees G. M. Snoek},   title = {Color Descriptors for Object Category Recognition},   booktitle = {Proceedings of the {IS\&T} European Conference on Colour in Graphics, Imaging, and Vision},   pages = {},   month = {June},   year = {2008},   address = {Terrassa-Barcelona, Spain},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/sande-color-descriptors-cgiv2008.pdf},   abstract = { Category recognition is important to access visual information on the level of objects. A common approach is to compute image descriptors first and then to apply machine learning to achieve category recognition from annotated examples. As a consequence,the choice of image descriptors is of great influence on the recognition accuracy. So far, intensity-based (e.g. SIFT) descriptors computed at salient points have been used. However, color has been largely ignored. The question is, can color information improve accuracy of category recognition? Therefore, in this paper, we will extend both salient point detection and region description with color information. The extension of color descriptors is integrated into the framework of category recognition enabling to select both intensity and color variants. Our experiments on an image benchmark show that category recognition benefits from the use of color. Moreover, the combination of intensity and color descriptors yields a 30\% improvement over intensity features alone. } }
78. Ork de Rooij, Cees G. M. Snoek, and Marcel Worring, "Query on Demand Video Browsing," in Proceedings of the ACM International Conference on Multimedia, Augsburg, Germany, 2007, pp. 811-814.
@INPROCEEDINGS{RooijACM07,   author = {Ork de Rooij and Cees G. M. Snoek and Marcel Worring},   title = {Query on Demand Video Browsing},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia},   pages = {811--814},   month = {September},   year = {2007},   address = {Augsburg, Germany},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/rooij-rotor-acm2007.pdf},   abstract = { This paper describes a novel method for browsing a large collection of news video by linking various forms of related video fragments together as threads. Each thread contains a sequence of shots with high feature-based similarity. Two interfaces are designed which use threads as the basis for browsing. One interface shows a minimal set of threads, and the other as many as possible. Both interfaces are evaluated in the TRECVID interactive retrieval task, where they ranked among the best interactive retrieval systems currently available. The results indicate that the use of threads in interactive video search is very beneficial. We have found that in general the query result and the timeline are the most important threads. However, having several additional threads allow a user to find unique results which cannot easily be found by using query results and time alone. } }
79. Arnold W. M. Smeulders, Jan C. van Gemert, Bouke Huurnink, Dennis C. Koelma, Ork de Rooij, Koen E. A. van de Sande, Cees G. M. Snoek, Cor J. Veenman, and Marcel Worring, "Semantic Video Search," in International Conference on Image Analysis and Processing, Modena, Italy, 2007.
@INPROCEEDINGS{SmeuldersICIAP07,   author = {Arnold W. M. Smeulders and Jan C. van Gemert and Bouke Huurnink and Dennis C. Koelma and Ork de Rooij and Koen E. A. van de Sande and Cees G. M. Snoek and Cor J. Veenman and Marcel Worring},   title = {Semantic Video Search},   booktitle = {International Conference on Image Analysis and Processing},   pages = {},   month = {September},   year = {2007},   address = {Modena, Italy},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/smeulders-search-iciap2007.pdf},   abstract = { In this paper we describe the current performance of our MediaMill system as presented in the TRECVID 2006 benchmark for video search engines. The MediaMill team participated in two tasks: concept detection and search. For concept detection we use the MediaMill Challenge as experimental platform. The MediaMill Challenge divides the generic video indexing problem into a visual-only, textual-only, early fusion, late fusion, and combined analysis experiment. We provide a baseline implementation for each experiment together with baseline results. We extract image features, on global, regional, and keypoint level, which we combine with various supervised learners. A late fusion approach of visual-only analysis methods using geometric mean was our most successful run. With this run we conquer the Challenge baseline by more than 50\%. Our concept detection experiments have resulted in the best score for three concepts: i.e. \emph{desert},   \emph{flag us},   and \emph{charts}. What is more, using LSCOM annotations, our visual-only approach generalizes well to a set of 491 concept detectors. To handle such a large thesaurus in retrieval, an engine is developed which allows users to select relevant concept detectors based on interactive browsing using advanced visualizations. Similar to previous years our best interactive search runs yield top performance, ranking 2nd and 6th overall. } }
80. Cees G. M. Snoek, Marcel Worring, Arnold W. M. Smeulders, and Bauke Freiburg, "The Role of Visual Content and Style for Concert Video Indexing," in Proceedings of the IEEE International Conference on Multimedia & Expo, Beijing, China, 2007, pp. 252-255.
@INPROCEEDINGS{SnoekICME07b,   author = {Cees G. M. Snoek and Marcel Worring and Arnold W. M. Smeulders and Bauke Freiburg},   title = {The Role of Visual Content and Style for Concert Video Indexing},   booktitle = {Proceedings of the {IEEE} International Conference on Multimedia \& Expo},   pages = {252--255},   month = {July},   year = {2007},   address = {Beijing, China},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-fabchannel-icme2007.pdf},   abstract = { This paper contributes to the automatic indexing of concert video. In contrast to traditional methods, which rely primarily on audio information for summarization applications, we explore how a visual-only concept detection approach could be employed. We investigate how our recent method for news video indexing -- which takes into account the role of content and style -- generalizes to the concert domain. We analyze concert video on three levels of visual abstraction, namely: content, style, and their fusion. Experiments with 12 concept detectors, on 45 hours of visually challenging concert video, show that the automatically learned best approach is concept-dependent. Moreover, these results suggest that the visual modality provides ample opportunity for more effective indexing and retrieval of concert video when used in addition to the auditory modality. } }
81. Cees G. M. Snoek and Marcel Worring, "Are Concept Detector Lexicons Effective for Video Search?," in Proceedings of the IEEE International Conference on Multimedia & Expo, Beijing, China, 2007, pp. 1966-1969.
@INPROCEEDINGS{SnoekICME07a,   author = {Cees G. M. Snoek and Marcel Worring},   title = {Are Concept Detector Lexicons Effective for Video Search?},   booktitle = {Proceedings of the {IEEE} International Conference on Multimedia \& Expo},   pages = {1966--1969},   month = {July},   year = {2007},   address = {Beijing, China},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-concept-icme2007.pdf},   abstract = { Until now, systematic studies on the effectiveness of concept detectors for video search have been carried out using less than 20 detectors, or in combination with other retrieval techniques. We investigate whether video search using just large concept detector lexicons is a viable alternative for present day approaches. We demonstrate that increasing the number of concept detectors in a lexicon yields improved video retrieval performance indeed. In addition, we show that combining concept detectors at query time has the potential to boost performance further. We obtain the experimental evidence on the automatic video search task of TRECVID 2005 using 363 machine learned concept detectors. } }
82. Marcel Worring, Cees G. M. Snoek, Ork de Rooij, Giang P. Nguyen, and Arnold W. M. Smeulders, "The MediaMill Semantic Video Search Engine," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Honolulu, Hawaii, USA, 2007, pp. 1213-1216.
@INPROCEEDINGS{WorringICASSP07,   author = {Marcel Worring and Cees G. M. Snoek and Ork de Rooij and Giang P. Nguyen and Arnold W. M. Smeulders},   title = {The {MediaMill} Semantic Video Search Engine},   booktitle = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing},   volume = {4},   pages = {1213--1216},   month = {April},   year = {2007},   address = {Honolulu, Hawaii, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/worring-mediamill-icassp2007.pdf},   abstract = { In this paper we present the methods underlying the MediaMill semantic video search engine. The basis for the engine is a semantic indexing process which is currently based on a lexicon of 491 concept detectors. To support the user in navigating the collection, the system defines a visual similarity space, a semantic similarity space, a semantic thread space, and browsers to explore them. We compare the different browsers and their utility within the TRECVID benchmark. In 2005, We obtained a top-3 result for 19 out of 24 search topics. In 2006 for 14 out of 24. } }
83. Cees G. M. Snoek, Marcel Worring, Jan C. van Gemert, Jan-Mark Geusebroek, and Arnold W. M. Smeulders, "The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia," in Proceedings of the ACM International Conference on Multimedia, Santa Barbara, USA, 2006, pp. 421-430.
@INPROCEEDINGS{SnoekACM06,   author = {Cees G. M. Snoek and Marcel Worring and Jan C. van Gemert and Jan-Mark Geusebroek and Arnold W. M. Smeulders},   title = {The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia},   pages = {421--430},   month = {October},   year = {2006},   address = {Santa Barbara, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-challenge-acm2006.pdf},   data = {http://www.mediamill.nl/challenge/},   abstract = { We introduce the challenge problem for generic video indexing to gain insight in intermediate steps that affect performance of multimedia analysis methods, while at the same time fostering repeatability of experiments. To arrive at a challenge problem, we provide a general scheme for the systematic examination of automated concept detection methods, by decomposing the generic video indexing problem into 2 unimodal analysis experiments, 2 multimodal analysis experiments, and 1 combined analysis experiment. For each experiment, we evaluate generic video indexing performance on 85 hours of international broadcast news data, from the TRECVID 2005/2006 benchmark, using a lexicon of 101 semantic concepts. By establishing a minimum performance on each experiment, the challenge problem allows for component-based optimization of the generic indexing issue, while simultaneously offering other researchers a reference for comparison during indexing methodology development. To stimulate further investigations in intermediate analysis steps that influence video indexing performance, the challenge offers to the research community a manually annotated concept lexicon, pre-computed low-level multimedia features, trained classifier models, and five experiments together with baseline performance, which are all available at http://www.mediamill.nl/challenge/. } }
84. Jan C. van Gemert, Cees G. M. Snoek, Cor Veenman, and Arnold W. M. Smeulders, "The Influence of Cross-Validation on Video Classification Performance," in Proceedings of the ACM International Conference on Multimedia, Santa Barbara, USA, 2006, pp. 695-698.
@INPROCEEDINGS{GemertACM06,   author = {Jan C. van Gemert and Cees G. M. Snoek and Cor Veenman and Arnold W. M. Smeulders},   title = {The Influence of Cross-Validation on Video Classification Performance},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia},   pages = {695--698},   month = {October},   year = {2006},   address = {Santa Barbara, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/gemert-crossvalidation-acm2006.pdf},   abstract = { Digital video is sequential in nature. When video data is used in a semantic concept classification task, the episodes are usually summarized with shots. The shots are annotated as containing, or not containing, a certain concept resulting in a labeled dataset. These labeled shots can subsequently be used by supervised learning methods (classifiers) where they are trained to predict the absence or presence of the concept in unseen shots and episodes. The performance of such automatic classification systems is usually estimated with cross-validation. By taking random samples from the dataset for training and testing as such, part of the shots from an episode are in the training set and another part from the same episode is in the test set. Accordingly, data dependence between training and test set is introduced, resulting in too optimistic performance estimates. In this paper, we experimentally show this bias, and propose how this bias can be prevented using "episode-constrained" cross-validation. Moreover, we show that a 15\% higher classifier performance can be achieved by using episode constrained cross-validation for classifier parameter tuning. } }
85. Marcel Worring, Cees G. M. Snoek, Ork de Rooij, Giang P. Nguyen, and Dennis C. Koelma, "Lexicon-based Browsers for Searching in News Video Archives," in Proceedings of the International Conference on Pattern Recognition, Hong Kong, China, 2006, pp. 1256-1259.
@INPROCEEDINGS{WorringICPR06,   author = {Marcel Worring and Cees G. M. Snoek and Ork de Rooij and Giang P. Nguyen and Dennis C. Koelma},   title = {Lexicon-based Browsers for Searching in News Video Archives},   booktitle = {Proceedings of the International Conference on Pattern Recognition},   pages = {1256--1259},   month = {August},   year = 2006, address = {Hong Kong, China},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/worring-browsers-icpr2006.pdf},   abstract = { In this paper we present the methods and visualizations used in the MediaMill video search engine. The basis for the engine is a semantic indexing process which derives a lexicon of 101 concepts. To support the user in navigating the collection, the system defines a visual similarity space, a semantic similarity space, a semantic thread space, and browsers to explore them. The search system is evaluated within the TRECVID benchmark. We obtain a top-3 result for 19 out of 24 search topics. In addition, we obtain the highest mean average precision of all search participants. } }
86. Cees G. M. Snoek, Marcel Worring, Dennis C. Koelma, and Arnold W. M. Smeulders, "Learned Lexicon-driven Interactive Video Retrieval," in Proceedings of the International Conference on Image and Video Retrieval, CIVR 2006, Tempe, Arizona, July 13-15, 2006, Heidelberg, Germany, 2006, pp. 11-20.
@INPROCEEDINGS{SnoekCIVR06,   author = {Cees G. M. Snoek and Marcel Worring and Dennis C. Koelma and Arnold W. M. Smeulders},   title = {Learned Lexicon-driven Interactive Video Retrieval},   booktitle = {Proceedings of the International Conference on Image and Video Retrieval, CIVR 2006, Tempe, Arizona, July 13-15, 2006},   editor = {H. Sundaram and others},   series = {LNCS},   volume = {4071},   pages = {11--20},   publisher = {Springer-Verlag},   address = {Heidelberg, Germany},   month = {July},   year = 2006, pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-lexicon-civr2006.pdf},   demo = {http://isis-data.science.uva.nl/cgmsnoek/index.php/demonstrations/mediamill/},   abstract = { We combine in this paper automatic learning of a large lexicon of semantic concepts with traditional video retrieval methods into a novel approach to narrow the semantic gap. The core of the proposed solution is formed by the automatic detection of an unprecedented lexicon of 101 concepts. From there, we explore the combination of query-by-concept, query-by-example, query-by-keyword, and user interaction into the \emph{MediaMill} semantic video search engine. We evaluate the search engine against the 2005 NIST TRECVID video retrieval benchmark, using an international broadcast news archive of 85 hours. Top ranking results show that the lexicon-driven search engine is highly effective for interactive video retrieval. } }
87. Cees G. M. Snoek, Marcel Worring, Jan-Mark Geusebroek, Dennis C. Koelma, Frank J. Seinstra, and Arnold W. M. Smeulders, "The Semantic Pathfinder for Generic News Video Indexing," in Proceedings of the IEEE International Conference on Multimedia & Expo, Toronto, Canada, 2006, pp. 1469-1472.
@INPROCEEDINGS{SnoekICME06,   author = {Cees G. M. Snoek and Marcel Worring and Jan-Mark Geusebroek and Dennis C. Koelma and Frank J. Seinstra and Arnold W. M. Smeulders},   title = {The Semantic Pathfinder for Generic News Video Indexing},   booktitle = {Proceedings of the {IEEE} International Conference on Multimedia \& Expo},   pages = {1469--1472},   month = {July},   year = {2006},   address = {Toronto, Canada},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-pathfinder-icme2006.pdf},   abstract = { This paper presents the semantic pathfinder architecture for generic indexing of video archives. The pathfinder automatically extracts semantic concepts from video based on the exploration of different paths through three consecutive analysis steps, closely linked to the video production process, namely: content analysis, style analysis, and context analysis. The virtue of the semantic pathfinder is its learned ability to find a best path of analysis steps on a per-concept basis. To show the generality of this indexing approach we develop detectors for a lexicon of 32 concepts and we evaluate the semantic pathfinder against the 2004 NIST TRECVID video retrieval benchmark, using a news archive of 64 hours. Top ranking performance indicates the merit of the semantic pathfinder. } }
88. Jan C. van Gemert, Jan-Mark Geusebroek, Cor J. Veenman, Cees G. M. Snoek, and Arnold W. M. Smeulders, "Robust Scene Categorization by Learning Image Statistics in Context," in Int’l Workshop on Semantic Learning Applications in Multimedia, in conjunction with CVPR’06, New York, USA, 2006, pp. 105-112.
@INPROCEEDINGS{GemertSLAM06,   author = {Jan C. van Gemert and Jan-Mark Geusebroek and Cor J. Veenman and Cees G. M. Snoek and Arnold W. M. Smeulders},   title = {Robust Scene Categorization by Learning Image Statistics in Context},   booktitle = {Int'l Workshop on Semantic Learning Applications in Multimedia, in conjunction with {CVPR'06}},   pages = {105--112},   month = {June},   year = {2006},   address = {New York, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/gemert-scene-slam2006.pdf},   abstract = { We present a generic and robust approach for scene categorization. A complex scene is described by proto-concepts like vegetation, water, fire, sky etc. These proto-concepts are represented by low level features, where we use natural images statistics to compactly represent color invariant texture information by a Weibull distribution. We introduce the notion of contextures which preserve the context of textures in a visual scene with an occurrence histogram (context) of similarities to proto-concept descriptors (texture). In contrast to a codebook approach, we use the similarity to all vocabulary elements to generalize beyond the code words. Visual descriptors are attained by combining different types of contexts with different texture parameters. The visual scene descriptors are generalized to visual categories by training a support vector machine. We evaluate our approach on 3 different datasets: 1) 50 categories for the TRECVID video dataset; 2) the Caltech 101-object images; 3) 89 categories being the intersection of the Corel photo stock with the Art Explosion photo stock. Results show that our approach is robust over different datasets, while maintaining competitive performance. } }
89. Arnold W. M. Smeulders, Jan C. van Gemert, Jan-Mark Geusebroek, Cees G. M. Snoek, and Marcel Worring, "Browsing for the National Dutch Video Archive," in ISCCSP2006, Marrakech, Morocco, 2006.
@INPROCEEDINGS{SmeuldersISCCSP06,   author = {Arnold W. M. Smeulders and Jan C. van Gemert and Jan-Mark Geusebroek and Cees G. M. Snoek and Marcel Worring},   title = {Browsing for the National {Dutch} Video Archive},   booktitle = {ISCCSP2006},   pages = {},   month = {March},   year = {2006},   address = {Marrakech, Morocco},   pdf = {http://www.science.uva.nl/~smeulder/pubs/ISCCSP2006SmeuldersTEMP.pdf},   abstract = { Pictures have always been a prime carrier of Dutch culture. But pictures take a new form. We live in times of broad- and narrowcasting through Internet, of passive and active viewers, of direct or delayed broadcast, and of digital pictures being delivered in the museum or at home. At the same time, the picture and television archives turn digital. Archives are going to be swamped with information requests unless they swiftly adapt to partially automatic annotation and digital retrieval. Our aim is to provide faster and more complete access to picture archives by digital analysis. Our approach consists of a multi-media analysis of features of pictures in tandem with the language that describes those pictures, under the guidance of a visual ontology. The general scientific paradigm we address is the detection of directly observables fused into semantic features learned from large repositories of digital video. We use invariant, natural-image statisticsbased contextual feature sets for capturing the concepts of images and integrate that as early as possible with text. The system consists of a large for science yet small for practice set of visual concepts permitting the retrieval of semantically formulated queries. We will demonstrate a PC-based, off-line trained state of the art system for browsing broadcast news-archives. } }
90. Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders, "Early versus Late Fusion in Semantic Video Analysis," in Proceedings of the ACM International Conference on Multimedia, Singapore, 2005, pp. 399-402.
@INPROCEEDINGS{SnoekACM05a,   author = {Cees G. M. Snoek and Marcel Worring and Arnold W. M. Smeulders},   title = {Early versus Late Fusion in Semantic Video Analysis},   booktitle = {Proceedings of the {ACM} International Conference on Multimedia},   pages = {399--402},   month = {November},   year = {2005},   address = {Singapore},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-earlylate-acm2005.pdf},   abstract = { Semantic analysis of multimodal video aims to index segments of interest at a conceptual level. In reaching this goal, it requires an analysis of several information streams. At some point in the analysis these streams need to be fused. In this paper, we consider two classes of fusion schemes, namely early fusion and late fusion. The former fuses modalities in feature space, the latter fuses modalities in semantic space. We show by experiment on 184 hours of broadcast video data and for 20 semantic concepts, that late fusion tends to give slightly better performance for most concepts. However, for those concepts where early fusion performs better the difference is more significant. } }
91. Cees G. M. Snoek, Marcel Worring, Jan-Mark Geusebroek, Dennis C. Koelma, and Frank J. Seinstra, "On the Surplus Value of Semantic Video Analysis Beyond the Key Frame," in Proceedings of the IEEE International Conference on Multimedia & Expo, Amsterdam, The Netherlands, 2005.
@INPROCEEDINGS{SnoekICME05a,   author = {Cees G. M. Snoek and Marcel Worring and Jan-Mark Geusebroek and Dennis C. Koelma and Frank J. Seinstra},   title = {On the Surplus Value of Semantic Video Analysis Beyond the Key Frame},   booktitle = {Proceedings of the {IEEE} International Conference on Multimedia \& Expo},   pages = {},   month = {July},   year = {2005},   address = {Amsterdam, The Netherlands},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-surplus-icme2005.pdf},   abstract = { Typical semantic video analysis methods aim for classification of camera shots based on extracted features from a single key frame only. In this paper, we sketch a video analysis scenario and evaluate the benefit of analysis beyond the key frame for semantic concept detection performance. We developed detectors for a lexicon of 26 concepts, and evaluated their performance on 120 hours of video data. Results show that, on average, detection performance can increase with almost 40\% when the analysis method takes more visual content into account. } }
92. Cees G. M. Snoek and Marcel Worring, "Multimedia Pattern Recognition in Soccer Video using Time Intervals," in Classification the Ubiquitous Challenge, Proceedings of the 28th Annual Conference of the Gesellschaft fur Klassifikation e.V., University of Dortmund, March 9-11, 2004, Berlin, Germany, 2005, pp. 97-108.
@INPROCEEDINGS{SnoekGFKL05,   author = {Cees G. M. Snoek and Marcel Worring},   title = {Multimedia Pattern Recognition in Soccer Video using Time Intervals},   booktitle = {Classification the Ubiquitous Challenge, Proceedings of the 28th Annual Conference of the Gesellschaft fur Klassifikation e.V., University of Dortmund, March 9-11, 2004},   publisher = {Springer-Verlag},   series = {Studies in Classification, Data Analysis, and Knowledge Organization},   editors = {C. Weihs and W. Gaul},   pages = {97--108},   year = {2005},   address = {Berlin, Germany},   pdf = {},   demo = {http://www.goalgle.com/},   abstract = { In this paper we propose the Time Interval Multimedia Event (TIME) framework as a robust approach for recognition of multimedia patterns, e.g. highlight events, in soccer video. The representation used in TIME extends the Allen temporal interval relations and allows for proper inclusion of context and synchronization of the heterogeneous information sources involved in multimedia pattern recognition. For automatic classification of highlights in soccer video, we compare three different machine learning techniques, i.c. C4.5 decision tree, Maximum Entropy, and Support Vector Machine. It was found that by using the TIME framework the amount of video a user has to watch in order to see almost all highlights can be reduced considerably, especially in combination with a Support Vector Machine. } }
93. Frank J. Seinstra, Cees G. M. Snoek, Dennis C. Koelma, Jan-Mark Geusebroek, and Marcel Worring, "User Transparent Parallel Processing of the 2004 NIST TRECVID Data Set," in Proceedings of the 19th IEEE International Parallel & Distributed Processing Symposium, Denver, USA, 2005, pp. 90-97.
@INPROCEEDINGS{SeinstraIPDPS05,   author = {Frank J. Seinstra and Cees G. M. Snoek and Dennis C. Koelma and Jan-Mark Geusebroek and Marcel Worring},   title = {User Transparent Parallel Processing of the 2004 {NIST} {TRECVID} Data Set},   booktitle = {Proceedings of the 19th IEEE International Parallel \& Distributed Processing Symposium},   pages = {90--97},   month = {April},   year = {2005},   address = {Denver, USA},   pdf = {http://staff.science.uva.nl/~fjseins/Papers/Conferences/ipdps2005.pdf},   abstract = { The Parallel-Horus framework, developed at the University of Amsterdam, is a unique software architecture that allows non-expert parallel programmers to develop fully sequential multimedia applications for efficient execution on homogeneous Beowulf-type commodity clusters. Previously obtained results for realistic, but relatively small-sized applications have shown the feasibility of the Parallel-Horus approach, with parallel performance consistently being found to be optimal with respect to the abstraction level of message passing programs. In this paper we discuss the most serious challenge Parallel-Horus has had to deal with so far: the processing of over 184 hours of video included in the 2004 NIST TRECVID evaluation, i.e. the de facto international standard benchmark for content-based video retrieval. Our results and experiences confirm that Parallel- Horus is a very powerful support-tool for state-of-the-art research and applications in multimedia processing. } }
94. Cees G. M. Snoek, Marcel Worring, and Alexander G. Hauptmann, "Detection of TV News Monologues by Style Analysis," in Proceedings of the IEEE International Conference on Multimedia & Expo, Taipei, Taiwan, 2004.
@INPROCEEDINGS{SnoekICME04,   author = {Cees G. M. Snoek and Marcel Worring and Alexander G. Hauptmann},   title = {Detection of {TV} News Monologues by Style Analysis},   booktitle = {Proceedings of the {IEEE} International Conference on Multimedia \& Expo},   pages = {},   month = {June},   year = {2004},   address = {Taipei, Taiwan},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-style-icme2004.pdf},   abstract = { We propose a method for detection of semantic concepts in produced video based on style analysis. Recognition of concepts is done by applying a classifier ensemble to the detected style elements. As a case study we present a method for detecting the concept of news subject monologues. Our approach had the best average precision performance amongst 26 submissions in the 2003 TRECVID benchmark. } }
95. Cees G. M. Snoek and Marcel Worring, "Time Interval Maximum Entropy based Event Indexing in Soccer Video," in Proceedings of the IEEE International Conference on Multimedia & Expo, Baltimore, USA, 2003, pp. 481-484.
@INPROCEEDINGS{SnoekICME03a,   author = {Cees G. M. Snoek and Marcel Worring},   title = {Time Interval Maximum Entropy based Event Indexing in Soccer Video},   booktitle = {Proceedings of the {IEEE} International Conference on Multimedia \& Expo},   pages = {481--484},   month = {July},   year = {2003},   address = {Baltimore, USA},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/icme2003.pdf},   demo = {http://www.goalgle.com/},   abstract = { Multimodal indexing of events in video documents poses problems with respect to representation, inclusion of contextual information, and synchronization of the heterogeneous information sources involved. In this paper we present the Time Interval Maximum Entropy (TIME) framework that tackles aforementioned problems. To demonstrate the viability of TIME for event classification in multimodal video, an evaluation was performed on the domain of soccer broadcasts. It was found that by applying TIME, the amount of video a user has to watch in order to see almost all highlights can be reduced considerably. } }
96. Marcel Worring, Andrew Bagdanov, Jan C. van Gemert, Jan-Mark Geusebroek, Minh Hoang, Guus Schreiber, Cees G. M. Snoek, Jeroen Vendrig, Jan Wielemaker, and Arnold W. M. Smeulders, "Interactive Indexing and Retrieval of Multimedia Content," in Proceedings of the 29th Annual Conference on Current Trends in Theory and Practice of Informatics, Milovy, Czech Republic, 2002, pp. 135-148.
@INPROCEEDINGS{WorringSOFSEM02,   author = {Marcel Worring and Andrew Bagdanov and Jan C. van Gemert and Jan-Mark Geusebroek and Minh Hoang and Guus Schreiber and Cees G. M. Snoek and Jeroen Vendrig and Jan Wielemaker and Arnold W. M. Smeulders},   title = {Interactive Indexing and Retrieval of Multimedia Content},   booktitle = {Proceedings of the 29th Annual Conference on Current Trends in Theory and Practice of Informatics},   series = {Lecture Notes in Computer Science},   volume = {2540},   pages = {135-148},   publisher = {Springer-Verlag},   year = {2002},   address = {Milovy, Czech Republic},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/sofsem2002.pdf},   abstract = { The indexing and retrieval of multimedia items is difficult due to the semantic gap between the user's perception of the data and the descriptions we can derive automatically from the data using computer vision, speech recognition, and natural language processing. In this contribution we consider the nature of the semantic gap in more detail and show examples of methods that help in limiting the gap. These methods can be automatic, but in general the indexing and retrieval of multimedia items should be a collaborative process between the system and the user. We show how to employ the user's interaction for limiting the semantic gap. } }
97. Cees G. M. Snoek and Marcel Worring, "A Review on Multimodal Video Indexing," in Proceedings of the IEEE International Conference on Multimedia & Expo, Lausanne, Switzerland, 2002, pp. 21-24.
@INPROCEEDINGS{SnoekICME02,   author = {Cees G. M. Snoek and Marcel Worring},   title = {A Review on Multimodal Video Indexing},   booktitle = {Proceedings of the {IEEE} International Conference on Multimedia \& Expo},   volume = {2},   pages = {21--24},   month = {August},   year = {2002},   address = {Lausanne, Switzerland},   pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/icme2002.pdf},   abstract = { Efficient and effective handling of video documents depends on the availability of indexes. Manual indexing is unfeasible for large video collections. Efficient, single modality based, video indexing methods have appeared in literature. Effective indexing, however, requires a multimodal approach in which either the most appropriate modality is selected or the different modalities are used in collaborative fashion. In this paper we present a framework for multimodal video indexing, which views a video document from the perspective of its author. The framework serves as a blueprint for a generic and flexible multimodal video indexing system, and generalizes different state-of-the-art video indexing methods. It furthermore forms the basis for categorizing these different methods. } }

## National Meetings

1. Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders, "Convex Reduced Kernels for Visual Categorization," in Proceedings of the 18th Annual Conference of the Advanced School for Computing and Imaging, Rotterdam, The Netherlands, 2012.
Best paper award
@INPROCEEDINGS{GavvesASCI12,   author = {Efstratios Gavves and Cees G. M. Snoek and Arnold W. M. Smeulders},   title = {Convex Reduced Kernels for Visual Categorization},   booktitle = {Proceedings of the 18th Annual Conference of the Advanced School for Computing and Imaging},   pages = {},   address = {Rotterdam, The Netherlands},   month = {October},   year = {2012},   note = {Best paper award},   pdf = {} }
2. Amirhossein Habibian and Cees G. M. Snoek, "Stop-Frame Removal Improves Web Video Classification," in Proceedings of the 18th Annual Conference of the Advanced School for Computing and Imaging, Rotterdam, The Netherlands, 2012.
Best poster award
@INPROCEEDINGS{HabibianASCI12,   author = {Amirhossein Habibian and Cees G. M. Snoek},   title = {Stop-Frame Removal Improves Web Video Classification},   booktitle = {Proceedings of the 18th Annual Conference of the Advanced School for Computing and Imaging},   pages = {},   address = {Rotterdam, The Netherlands},   month = {October},   year = {2012},   note = {Best poster award},   pdf = {} }
3. Svetlana Kordumova, Xirong Li, and Cees G. M. Snoek, "Learning Concepts from the Web: Some Frames are More Important than Others," in Proceedings of the 18th Annual Conference of the Advanced School for Computing and Imaging, Rotterdam, The Netherlands, 2012.
@INPROCEEDINGS{KordumovaASCI12,   author = {Svetlana Kordumova and Xirong Li and Cees G. M. Snoek},   title = {Learning Concepts from the Web: Some Frames are More Important than Others},   booktitle = {Proceedings of the 18th Annual Conference of the Advanced School for Computing and Imaging},   pages = {},   address = {Rotterdam, The Netherlands},   month = {October},   year = {2012},   pdf = {} }
4. Masoud Mazloom, Efstratios Gavves, Koen E. A. van de Sande, and Cees G. M. Snoek, "Learning to Select Semantic Video Event Representations," in Proceedings of the 18th Annual Conference of the Advanced School for Computing and Imaging, Rotterdam, The Netherlands, 2012.
@INPROCEEDINGS{MazloomASCI12,   author = {Masoud Mazloom and Efstratios Gavves and Koen E. A. van de Sande and Cees G. M. Snoek},   title = {Learning to Select Semantic Video Event Representations},   booktitle = {Proceedings of the 18th Annual Conference of the Advanced School for Computing and Imaging},   pages = {},   address = {Rotterdam, The Netherlands},   month = {October},   year = {2012},   pdf = {} }
5. Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders, "Landmark Image Retrieval with Visual Synonyms," in Proceedings of the 16th Annual Conference of the Advanced School for Computing and Imaging, Veldhoven, The Netherlands, 2010.
Best paper award
@INPROCEEDINGS{GavvesASCI10,   author = {Efstratios Gavves and Cees G. M. Snoek and Arnold W. M. Smeulders},   title = {Landmark Image Retrieval with Visual Synonyms},   booktitle = {Proceedings of the 16th Annual Conference of the Advanced School for Computing and Imaging},   pages = {},   address = {Veldhoven, The Netherlands},   month = {November},   year = {2010},   note = {Best paper award},   pdf = {} }
6. Xirong Li, Cees G. M. Snoek, and Marcel Worring, "Combining Multi-feature Tag Relevance Learning for Social Image Retrieval," in Proceedings of the 16th Annual Conference of the Advanced School for Computing and Imaging, Veldhoven, The Netherlands, 2010.
@INPROCEEDINGS{LiASCI10,   author = {Xirong Li and Cees G. M. Snoek and Marcel Worring},   title = {Combining Multi-feature Tag Relevance Learning for Social Image Retrieval},   booktitle = {Proceedings of the 16th Annual Conference of the Advanced School for Computing and Imaging},   pages = {},   address = {Veldhoven, The Netherlands},   month = {November},   year = {2010},   pdf = {} }
7. Xirong Li, Cees G. M. Snoek, and Marcel Worring, "Tag Relevance Learning for Social Image Retrieval and Labeling," in Proceedings of the 15th Annual Conference of the Advanced School for Computing and Imaging, Zeewolde, The Netherlands, 2009.
@INPROCEEDINGS{LiASCI09,   author = {Xirong Li and Cees G. M. Snoek and Marcel Worring},   title = {Tag Relevance Learning for Social Image Retrieval and Labeling},   booktitle = {Proceedings of the 15th Annual Conference of the Advanced School for Computing and Imaging},   pages = {},   address = {Zeewolde, The Netherlands},   month = {June},   year = {2009},   pdf = {} }
8. Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek, "Empowering Visual Categorization with the GPU," in Proceedings of the 15th Annual Conference of the Advanced School for Computing and Imaging, Zeewolde, The Netherlands, 2009.
@INPROCEEDINGS{SandeASCI09,   author = {Koen E. A. van de Sande and Theo Gevers and Cees G. M. Snoek},   title = {Empowering Visual Categorization with the {GPU}},   booktitle = {Proceedings of the 15th Annual Conference of the Advanced School for Computing and Imaging},   pages = {},   address = {Zeewolde, The Netherlands},   month = {June},   year = {2009},   pdf = {} }
9. Ork de Rooij, Cees G. M. Snoek, and Marcel Worring, "Consuming Videos with the ForkBrowser," in Proceedings of the 14th Annual Conference of the Advanced School for Computing and Imaging, Heijen, The Netherlands, 2008.
@INPROCEEDINGS{RooijASCI08,   author = {Ork de Rooij and Cees G. M. Snoek and Marcel Worring},   title = {Consuming Videos with the ForkBrowser},   booktitle = {Proceedings of the 14th Annual Conference of the Advanced School for Computing and Imaging},   pages = {},   address = {Heijen, The Netherlands},   month = {June},   year = {2008},   pdf = {} }
10. Ork de Rooij, Cees G. M. Snoek, and Marcel Worring, "Multi Thread Video Browsing," in Proceedings of the 13th Annual Conference of the Advanced School for Computing and Imaging, Heijen, The Netherlands, 2007.
@INPROCEEDINGS{RooijASCI07,   author = {Ork de Rooij and Cees G. M. Snoek and Marcel Worring},   title = {Multi Thread Video Browsing},   booktitle = {Proceedings of the 13th Annual Conference of the Advanced School for Computing and Imaging},   pages = {},   address = {Heijen, The Netherlands},   month = {June},   year = {2007},   pdf = {} }
11. Jan C. van Gemert, Jan-Mark Geusebroek, Cor J. Veenman, and Cees G. M. Snoek, "Generic and Robust Scene Categorization by Learning Context," in Proceedings of the 12th Annual Conference of the Advanced School for Computing and Imaging, Lommel, Belgium, 2006.
@INPROCEEDINGS{GemertASCI06,   author = {Jan C. van Gemert and Jan-Mark Geusebroek and Cor J. Veenman and Cees G. M. Snoek},   title = {Generic and Robust Scene Categorization by Learning Context},   booktitle = {Proceedings of the 12th Annual Conference of the Advanced School for Computing and Imaging},   pages = {},   address = {Lommel, Belgium},   month = {June},   year = {2006},   pdf = {} }
12. Cees G. M. Snoek and Marcel Worring, "Time Interval based Modelling and Classification of Events in Soccer Video," in Proceedings of the 9th Annual Conference of the Advanced School for Computing and Imaging, Heijen, The Netherlands, 2003.
@INPROCEEDINGS{SnoekASCI03,   author = {Cees G. M. Snoek and Marcel Worring},   title = {Time Interval based Modelling and Classification of Events in Soccer Video},   booktitle = {Proceedings of the 9th Annual Conference of the Advanced School for Computing and Imaging},   pages = {},   address = {Heijen, The Netherlands},   month = {June},   year = {2003},   pdf = {} }
13. Cees G. M. Snoek and Marcel Worring, "A State-of-the-art Review on Multimodal Video Indexing," in Proceedings of the 8th Annual Conference of the Advanced School for Computing and Imaging, Lochem, The Netherlands, 2002, pp. 194-202.
@INPROCEEDINGS{SnoekASCI02,   author = {Cees G. M. Snoek and Marcel Worring},   title = {A State-of-the-art Review on Multimodal Video Indexing},   booktitle = {Proceedings of the 8th Annual Conference of the Advanced School for Computing and Imaging},   pages = {194--202},   address = {Lochem, The Netherlands},   month = {June},   year = {2002},   pdf = {} }
This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each authorâ€™s copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.