Publications

Conference Papers

  1. Zhenyang Li, Ran Tao, Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders, "Tracking by Natural Language Specification," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017.
    @INPROCEEDINGS{LiCVPR17,
      author = {Zhenyang Li and Ran Tao and Efstratios Gavves and Cees G. M. Snoek and Arnold W. M. Smeulders},
      title = {Tracking by Natural Language Specification},
      booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},
      month = {July},
      year = {2017},
      address = {Honolulu, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-tracking-language-cvpr2017.pdf},
      abstract = { This paper strives to track a target object in a video. Rather than specifying the target in the first frame of a video by a bounding box, we propose to track the object based on a natural language specification of the target, which provides a more natural human-machine interaction as well as a means to improve tracking results. We define three variants of tracking by language specification: one relying on lingual target specification only, one relying on visual target specification based on language, and one leveraging their joint capacity. To show the potential of tracking by natural language specification we extend two popular tracking datasets with lingual descriptions and report experiments. Finally, we also sketch new tracking scenarios in surveillance and other live video streams that become feasible with a lingual specification of the target. }
    }
  2. Thomas Mensink, Thomas Jongstra, Pascal Mettes, and Cees G. M. Snoek, "Music-Guided Video Summarization using Quadratic Assignments," in Proceedings of the ACM International Conference on Multimedia Retrieval, Bucharest, Romania, 2017.
    @INPROCEEDINGS{MensinkICMR17,
      author = {Thomas Mensink and Thomas Jongstra and Pascal Mettes and Cees G. M. Snoek},
      title = {Music-Guided Video Summarization using Quadratic Assignments},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},
      month = {June},
      year = {2017},
      pages = {},
      address = {Bucharest, Romania},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mensink-music-video-summarization-icmr2017.pdf},
      abstract = { This paper aims to automatically generate a summary of an unedited video, guided by an externally provided music-track. The tempo, energy and beats in the music determine the choices and cuts in the video summarization. To solve this challenging task, we model video summarization as a quadratic assignment problem. We assign frames to the summary, using rewards based on frame interestingness, plot coherency, audio-visual match, and cut properties. Experimentally we validate our approach on the SumMe dataset. The results show that our music guided summaries are more appealing, and even outperform the current state-of-the-art summarization methods when evaluated on the F1 measure of precision and recall. }
    }
  3. Jianfeng Dong, Xirong Li, and Cees G. M. Snoek, "Word2VisualVec: Image and Video to Sentence Matching by Visual Feature Prediction," in ArXive, 2016.
    @INPROCEEDINGS{DongTEMP16,
      author = {Jianfeng Dong and Xirong Li and Cees G. M. Snoek},
      title = {Word2VisualVec: Image and Video to Sentence Matching by Visual Feature Prediction},
      booktitle = {ArXive},
      month = {},
      year = {2016},
      pages = {},
      pdf = {http://arxiv.org/abs/1604.06838}
    }
  4. Rama Kovvuri, Ram Nevatia, and Cees G. M. Snoek, "Segment-based Models for Event Detection and Recounting," in Proceedings of the International Conference on Pattern Recognition, Cancun, Mexico, 2016.
    @INPROCEEDINGS{KovvuriICPR16,
      author = {Rama Kovvuri and Ram Nevatia and Cees G. M. Snoek},
      title = {Segment-based Models for Event Detection and Recounting},
      booktitle = {Proceedings of the International Conference on Pattern Recognition},
      month = {December},
      year = {2016},
      pages = {},
      address = {Cancun, Mexico},
      pdf = {},
      abstract = { }
    }
  5. Jianfeng Dong, Xirong Li, Weiyu Lan, Yujia Huo, and Cees G. M. Snoek, "Early Embedding and Late Reranking for Video Captioning," in Proceedings of the ACM International Conference on Multimedia, Amsterdam, The Netherlands, 2016.
    Multimedia Grand Challenge winner
    @INPROCEEDINGS{DongMM16,
      author = {Jianfeng Dong and Xirong Li and Weiyu Lan and Yujia Huo and Cees G. M. Snoek},
      title = {Early Embedding and Late Reranking for Video Captioning},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia},
      month = {October},
      year = {2016},
      pages = {},
      address = {Amsterdam, The Netherlands},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/dong-captioning-mm2016.pdf},
      note = {Multimedia Grand Challenge winner},
      abstract = { This paper describes our solution for the MSR Video to Language Challenge. We start from the popular ConvNet + LSTM model, which we extend with two novel modules. One is early embedding, which enriches the current low-level input to LSTM by tag embeddings. The other is late reranking, for re-scoring generated sentences in terms of their relevance to a specific video. The modules are inspired by recent works on image captioning, repurposed and redesigned for video. As experiments on the MSR-VTT validation set show, the joint use of these two modules add a clear improvement over a non-trivial ConvNet + LSTM baseline under four performance metrics. The viability of the proposed solution is further confirmed by the blind test by the organizers. Our system is ranked at the 4th place in terms of overall performance, while scoring the best CIDEr-D, which measures the human-likeness of generated captions. }
    }
  6. Roeland De Geest, Efstratios Gavves, Amir Ghodrati, Zhenyang Li, Cees G. M. Snoek, and Tinne Tuytelaars, "Online Action Detection," in European Conference on Computer Vision, Amsterdam, The Netherlands, 2016.
    @INPROCEEDINGS{GeestECCV16,
      author = {Roeland De Geest and Efstratios Gavves and Amir Ghodrati and Zhenyang Li and Cees G. M. Snoek and Tinne Tuytelaars},
      title = {Online Action Detection},
      booktitle = {European Conference on Computer Vision},
      month = {October},
      year = {2016},
      pages = {},
      address = {Amsterdam, The Netherlands},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/geest-online-action-eccv2016.pdf},
      data = {http://homes.esat.kuleuven.be/~rdegeest/TVSeries.html},
      abstract = { In online action detection, the goal is to detect the start of an action in a video stream as soon as it happens. For instance, if a child is chasing a ball, an autonomous car should recognize what is going on and respond immediately. This is a very challenging problem for four reasons. First, only partial actions are observed. Second, there is a large variability in negative data. Third, the start of the action is unknown, so it is unclear over what time window the information should be integrated. Finally, in real world data, large within-class variability exists. This problem has been addressed before, but only to some extent. Our contributions to online action detection are threefold. First, we introduce a realistic dataset composed of 27 episodes from 6 popular TV series. The dataset spans over 16 hours of footage annotated with 30 action classes, totaling 6,231 action instances. Second, we analyze and compare various baseline methods, showing this is a challenging problem for which none of the methods provides a good solution. Third, we analyze the change in performance when there is a variation in viewpoint, occlusion, truncation, etc. We introduce an evaluation protocol for fair comparison. The dataset, the baselines and the models will all be made publicly available to encourage (much needed) further research on online action detection on realistic data. }
    }
  7. Pascal Mettes, Jan C. van Gemert, and Cees G. M. Snoek, "Spot On: Action Localization from Pointly-Supervised Proposals," in European Conference on Computer Vision, Amsterdam, The Netherlands, 2016.
    Oral presentation, top 1.8%
    @INPROCEEDINGS{MettesECCV16,
      author = {Pascal Mettes and Jan C. van Gemert and Cees G. M. Snoek},
      title = {Spot On: Action Localization from Pointly-Supervised Proposals},
      booktitle = {European Conference on Computer Vision},
      month = {October},
      year = {2016},
      pages = {},
      address = {Amsterdam, The Netherlands},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mettes-pointly-eccv2016.pdf},
      data = {http://isis-data.science.uva.nl/mettes/hollywood2tubes.tar.gz},
      note = {Oral presentation, top 1.8%},
      abstract = { We strive for spatio-temporal localization of actions in videos. The state-of-the-art relies on action proposals at test time and selects the best one with a classifier demanding carefully annotated box annotations at train time. Annotating action boxes in video is cumbersome, tedious, and error prone. Rather than annotating boxes, we propose to annotate actions in video with points on a sparse subset of frames only. We introduce an overlap measure between action proposals and points and incorporate them all into the objective of a non-convex Multiple Instance Learning optimization. Experimental evaluation on the UCF Sports and UCF 101 datasets shows that (i) spatio-temporal proposals can be used to train classifiers while retaining the localization performance, (ii) point annotations yield results comparable to box annotations while being significantly faster to annotate, (iii) with a minimum amount of supervision our approach is competitive to the state-of-the-art. Finally, we introduce spatio-temporal action annotations on the train and test videos of Hollywood2, resulting in Hollywood2Tubes, available at tinyurl.com/hollywood2tubes. }
    }
  8. Spencer Cappallo, Thomas Mensink, and Cees G. M. Snoek, "Video Stream Retrieval of Unseen Queries using Semantic Memory," in Proceedings of the British Machine Vision Conference, York, UK, 2016.
    @INPROCEEDINGS{CappalloBMVC16,
      author = {Spencer Cappallo and Thomas Mensink and Cees G. M. Snoek},
      title = {Video Stream Retrieval of Unseen Queries using Semantic Memory},
      booktitle = {Proceedings of the British Machine Vision Conference},
      month = {September},
      year = {2016},
      pages = {},
      address = {York, UK},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/cappallo-videostream-bmvc2016.pdf},
      abstract = { Retrieval of live, user-broadcast video streams is an under-addressed and increasingly relevant challenge. The on-line nature of the problem requires temporal evaluation and the unforeseeable scope of potential queries motivates an approach which can accommodate arbitrary search queries. To account for the breadth of possible queries, we adopt a no-example approach to query retrieval, which uses a query's semantic relatedness to pre-trained concept classifiers. To adapt to shifting video content, we propose memory pooling and memory welling methods that favor recent information over long past content. We identify two stream retrieval tasks, instantaneous retrieval at any particular time and continuous retrieval over a prolonged duration, and propose means for evaluating them. Three large scale video datasets are adapted to the challenge of stream retrieval. We report results for our search methods on the new stream retrieval tasks, as well as demonstrate their efficacy in a traditional, non-streaming video task. }
    }
  9. Svetlana Kordumova, Thomas Mensink, and Cees G. M. Snoek, "Pooling Objects for Recognizing Scenes without Examples," in Proceedings of the ACM International Conference on Multimedia Retrieval, New York, USA, 2016.
    Best paper award
    @INPROCEEDINGS{KordumovaICMR16,
      author = {Svetlana Kordumova and Thomas Mensink and Cees G. M. Snoek},
      title = {Pooling Objects for Recognizing Scenes without Examples},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},
      month = {June},
      year = {2016},
      pages = {},
      address = {New York, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/kordumova-pooling-objects-icmr2016.pdf},
      note = {Best paper award},
      abstract = { In this paper we aim to recognize scenes in images without using any scene images as training data. Different from attribute based approaches, we do not carefully select the training classes to match the unseen scene classes. Instead, we propose a pooling over ten thousand of off-the-shelf object classifiers. To steer the knowledge transfer between objects and scenes we learn a semantic embedding with the aid of a large social multimedia corpus. Our key contributions are: we are the first to investigate pooling over ten thousand object classifiers to recognize scenes without examples; we explore the ontological hierarchy of objects and analyze the influence of object classifiers from different hierarchy levels; we exploit object positions in scene images and we demonstrate a new scene retrieval scenario with complex queries. Finally, we outperform attribute representations on two challenging scene datasets, SUNAttributes and Places2. }
    }
  10. Pascal Mettes, Dennis Koelma, and Cees G. M. Snoek, "The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection," in Proceedings of the ACM International Conference on Multimedia Retrieval, New York, USA, 2016.
    @INPROCEEDINGS{MettesICMR16,
      author = {Pascal Mettes and Dennis Koelma and Cees G. M. Snoek},
      title = {The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},
      month = {June},
      year = {2016},
      pages = {},
      address = {New York, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mettest-imagenetshuffle-icmr2016.pdf},
      data = {http://tinyurl.com/imagenetshuffle},
      abstract = { This paper strives for video event detection using a representation learned from deep convolutional neural networks. Different from the leading approaches, who all learn from the 1,000 classes defined in the ImageNet Large Scale Visual Recognition Challenge, we investigate how to leverage the complete ImageNet hierarchy for pre-training deep networks. To deal with the problems of over-specific classes and classes with few images, we introduce a bottom-up and top-down approach for reorganization of the ImageNet hierarchy based on all its 21,814 classes and more than 14 million images. Experiments on the TRECVID Multimedia Event Detection 2013 and 2015 datasets show that video representations derived from the layers of a deep neural network pre-trained with our reorganized hierarchy i) improves over standard pre-training, ii) is complementary among different reorganizations, iii) maintains the benefits of fusion with other modalities, and iv) leads to state-of-the-art event detection results. The reorganized hierarchies and their derived Caffe models are publicly available at http://tinyurl.com/imagenetshuffle. }
    }
  11. Arnav Agharwal, Rama Kovvuri Ram Nevatia, and Cees G. M. Snoek, "Tag-based Video Retrieval by Embedding Semantic Content in a Continuous Word Space," in IEEE Winter Conference on Applications of Computer Vision, Lake Placid, USA, 2016, pp. 1-8.
    @INPROCEEDINGS{AgharwalWACV16,
      author = {Arnav Agharwal and Rama Kovvuri Ram Nevatia and Cees G. M. Snoek},
      title = {Tag-based Video Retrieval by Embedding Semantic Content in a Continuous Word Space},
      booktitle = {IEEE Winter Conference on Applications of Computer Vision},
      month = {March},
      year = {2016},
      pages = {1--8},
      address = {Lake Placid, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/agharwal-continuous-wacv2016.pdf},
      abstract = { Content-based event retrieval in unconstrained web videos, based on query tags, is a hard problem due to large intra-class variances, and limited vocabulary and accuracy of the video concept detectors, creating a "semantic query gap". We present a technique to overcome this gap by using continuous word space representations to explicitly compute query and detector concept similarity. This not only allows for fast query-video similarity computation with implicit query expansion, but leads to a compact video representation, which allows implementation of a real-time retrieval system that can fit several thousand videos in a few hundred megabytes of memory. We evaluate the effectiveness of our representation on the challenging NIST MEDTest 2014 dataset. }
    }
  12. Svetlana Kordumova, Jan C. van Gemert, and Cees G. M. Snoek, "Exploring the Long Tail of Social Media Tags," in International Conference on Multimedia Modelling, Miami, USA, 2016.
    @INPROCEEDINGS{KordumovaMMM16,
      author = {Svetlana Kordumova and Jan C. van Gemert and Cees G. M. Snoek},
      title = {Exploring the Long Tail of Social Media Tags},
      booktitle = {International Conference on Multimedia Modelling},
      month = {January},
      year = {2016},
      pages = {},
      address = {Miami, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/kordumova-longtail-mmm2016.pdf},
      abstract = { There are millions of users who tag multimedia content, generating a large vocabulary of tags. Some tags are frequent, while other tags are rarely used following a long tail distribution. For frequent tags, most of the multimedia methods that aim to automatically understand audio-visual content, give excellent results. It is not clear, however, how these methods will perform on rare tags. In this paper we investigate what social tags constitute the long tail and how they perform on two multimedia retrieval scenarios, tag relevance and detector learning. We show common valuable tags within the long tail, and by augmenting them with semantic knowledge, the performance of tag relevance and detector learning improves substantially. }
    }
  13. Efstratios Gavves, Thomas Mensink, Tatiana Tommasi, Cees G. M. Snoek, and Tinne Tuytelaars, "Active Transfer Learning with Zero-Shot Priors: Reusing Past Datasets for Future Tasks," in Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 2015.
    @INPROCEEDINGS{GavvesICCV15,
      author = {Efstratios Gavves and Thomas Mensink and Tatiana Tommasi and Cees G. M. Snoek and Tinne Tuytelaars},
      title = {Active Transfer Learning with Zero-Shot Priors: Reusing Past Datasets for Future Tasks},
      booktitle = {Proceedings of the {IEEE} International Conference on Computer Vision},
      pages = {},
      month = {December},
      year = {2015},
      address = {Santiago, Chile},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/gavves-zero-shot-priors-iccv2015.pdf},
      abstract = { How can we reuse existing knowledge, in the form of available datasets, when solving a new and apparently unrelated target task from a set of unlabeled data? In this work we make a first contribution to answer this question in the context of image classification. We frame this quest as an active learning problem and use zero-shot classifiers to guide the learning process by linking the new task to the existing classifiers. By revisiting the dual formulation of adaptive SVM, we reveal two basic conditions to choose greedily only the most relevant samples to be annotated. On this basis we propose an effective active learning algorithm which learns the best possible target classification model with minimum human labeling effort. Extensive experiments on two challenging datasets show the value of our approach compared to the state-of-the-art active learning methodologies, as well as its potential to reuse past datasets with minimal effort for future tasks. }
    }
  14. Mihir Jain, Jan C. van Gemert, Thomas Mensink, and Cees G. M. Snoek, "Objects2action: Classifying and localizing actions without any video example," in Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 2015.
    @INPROCEEDINGS{JainICCV15,
      author = {Mihir Jain and Jan C. van Gemert and Thomas Mensink and Cees G. M. Snoek},
      title = {Objects2action: Classifying and localizing actions without any video example},
      booktitle = {Proceedings of the {IEEE} International Conference on Computer Vision},
      month = {December},
      year = {2015},
      pages = {},
      address = {Santiago, Chile},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/jain-objects2action-iccv2015.pdf},
      data = {https://staff.fnwi.uva.nl/m.jain/projects/Objects2action.html},
      abstract = { The goal of this paper is to recognize actions in video without the need for examples. Different from traditional zero-shot approaches we do not demand the design and specification of attribute classifiers and class-to-attribute mappings to allow for transfer from seen classes to unseen classes. Our key contribution is objects2action, a semantic word embedding that is spanned by a skip-gram model of thousands of object categories. Action labels are assigned to an object encoding of unseen video based on a convex combination of action and object affinities. Our semantic embedding has three main characteristics to accommodate for the specifics of actions. First, we propose a mechanism to exploit multiple-word descriptions of actions and objects. Second, we incorporate the automated selection of the most responsive objects per action. And finally, we demonstrate how to extend our zero-shot approach to the spatio-temporal localization of actions in video. Experiments on four action datasets demonstrate the potential of our approach. }
    }
  15. Spencer Cappallo, Thomas Mensink, and Cees G. M. Snoek, "Image2Emoji: Zero-shot Emoji Prediction for Visual Media," in Proceedings of the ACM International Conference on Multimedia, Brisbane, Australia, 2015.
    @INPROCEEDINGS{CappalloMM15,
      author = {Spencer Cappallo and Thomas Mensink and Cees G. M. Snoek},
      title = {Image2Emoji: Zero-shot Emoji Prediction for Visual Media},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia},
      pages = {},
      month = {October},
      year = {2015},
      address = {Brisbane, Australia},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/cappallo-image2emoji-mm2015.pdf},
      demo = {http://www.emoji2video.com},
      abstract = { We present Image2Emoji, a multi-modal approach for generating emoji labels for an image in a zero-shot manner. Different from existing zero-shot image-to-text approaches, we exploit both image and textual media to learn a semantic embedding for the new task of emoji prediction. We propose that the widespread adoption of emoji suggests a semantic universality which is well-suited for interaction with visual media. We quantify the efficacy of our proposed model on the MSCOCO dataset, and demonstrate the value of visual, textual and multi-modal prediction of emoji. We conclude the paper with three examples of the application potential of emoji in the context of multimedia retrieval. }
    }
  16. Jan van Gemert, Mihir Jain, Ella Gati, and Cees G. M. Snoek, "APT: Action localization proposals from dense trajectories," in Proceedings of the British Machine Vision Conference, Swansea, UK, 2015.
    @INPROCEEDINGS{GemertBMVC15,
      author = {Jan van Gemert and Mihir Jain and Ella Gati and Cees G. M. Snoek},
      title = {{APT}: Action localization proposals from dense trajectories},
      booktitle = {Proceedings of the British Machine Vision Conference},
      month = {September},
      year = {2015},
      pages = {},
      address = {Swansea, UK},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/gemert-apt-proposals-bmvc2015-corrected.pdf},
      software = {https://github.com/jvgemert/apt},
      abstract = { This paper is on action localization in video with the aid of spatio-temporal proposals. To alleviate the computational expensive segmentation step of existing proposals, we propose bypassing the segmentations completely by generating proposals directly from the dense trajectories used to represent videos during classification. Our Action localization Proposals from dense Trajectories (APT) use an efficient proposal generation algorithm to handle the high number of trajectories in a video. Our spatio-temporal proposals are faster than current methods and outperform the localization and classification accuracy of current proposals on the UCF Sports, UCF 101, and MSR-II video datasets. Corrected version: we fixed a mistake in our UCF-101 ground truth. Numbers are different; conclusions are unchanged. }
    }
  17. Markus Nagel, Thomas Mensink, and Cees G. M. Snoek, "Event Fisher Vectors: Robust Encoding Visual Diversity of Visual Streams," in Proceedings of the British Machine Vision Conference, Swansea, UK, 2015.
    @INPROCEEDINGS{NagelBMVC15,
      author = {Markus Nagel and Thomas Mensink and Cees G. M. Snoek},
      title = {Event Fisher Vectors: Robust Encoding Visual Diversity of Visual Streams},
      booktitle = {Proceedings of the British Machine Vision Conference},
      month = {September},
      year = {2015},
      pages = {},
      address = {Swansea, UK},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/nagel-event-fisher-bmvc2015.pdf},
      abstract = { In this paper we focus on event recognition in visual image streams. More specifically, we aim to construct a compact representation which encodes the diversity of the visual stream from just a few observations. For this purpose, we introduce the Event Fisher Vector, a Fisher Kernel based representation to describe a collection of images or the sequential frames of a video. We explore different generative models beyond the Gaussian mixture model as underlying probability distribution. First, the Student?s-t mixture model which captures the heavy tails of the small sample size of a collection of images. Second, Hidden Markov Models to explicitly capture the temporal ordering of the observations in a stream. For all our models we derive analytical approximations of the Fisher information matrix, which significantly improves recognition performance. We extensively evaluate the properties of our proposed method on three recent datasets for event recognition in photo collections and web videos, leading to an efficient compact image representation which achieves state-of-the-art performance on all these datasets. }
    }
  18. Spencer Cappallo, Thomas Mensink, and Cees G. M. Snoek, "Latent Factors of Visual Popularity Prediction," in Proceedings of the ACM International Conference on Multimedia Retrieval, Shanghai, China, 2015.
    @INPROCEEDINGS{CappalloICMR15,
      author = {Spencer Cappallo and Thomas Mensink and Cees G. M. Snoek},
      title = {Latent Factors of Visual Popularity Prediction},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},
      month = {June},
      year = {2015},
      pages = {},
      address = {Shanghai, China},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/cappallo-visual-popularity-icmr2015.pdf},
      abstract = { Predicting the popularity of an image on social networks based solely on its visual content is a difficult problem. One image may become widely distributed and repeatedly shared, while another similar image may be totally overlooked. We aim to gain insight into how visual content affects image popularity. We propose a latent ranking approach that takes into account not only the distinctive visual cues in popular images, but also those in unpopular images. This method is evaluated on two existing datasets collected from photo-sharing websites, as well as a new proposed dataset of images from the microblogging website Twitter. Our experiments investigate factors of the ranking model, the level of user engagement in scoring popularity, and whether the discovered senses are meaningful. The proposed approach yields state of the art results, and allows for insight into the semantics of image popularity on social networks. }
    }
  19. Amirhossein Habibian, Thomas Mensink, and Cees G. M. Snoek, "Discovering Semantic Vocabularies for Cross-Media Retrieval," in Proceedings of the ACM International Conference on Multimedia Retrieval, Shanghai, China, 2015.
    @INPROCEEDINGS{HabibianICMR15,
      author = {Amirhossein Habibian and Thomas Mensink and Cees G. M. Snoek},
      title = {Discovering Semantic Vocabularies for Cross-Media Retrieval},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},
      month = {June},
      year = {2015},
      pages = {},
      address = {Shanghai, China},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/habibian-semantic-vocabularies-icmr2015.pdf},
      abstract = { This paper proposes a data-driven approach for cross-media retrieval by automatically learning its underlying semantic vocabulary. Different from the existing semantic vocabularies, which are manually pre-defined and annotated, we automatically discover the vocabulary concepts and their annotations from multimedia collections. To this end, we apply a probabilistic topic model on the text available in the collection to extract its semantic structure. Moreover, we propose a learning to rank framework, to effectively learn the concept classifiers from the extracted annotations. We evaluate the discovered semantic vocabulary for cross-media retrieval on three datasets of image/text and video/text pairs. Our experiments demonstrate that the discovered vocabulary does not require \emph{any} manual labeling to outperform three recent alternatives for cross-media retrieval. }
    }
  20. Masoud Mazloom, Amirhossein Habibian, Dong Liu, Cees G. M. Snoek, and Shih-Fu Chang, "Encoding Concept Prototypes for Video Event Detection and Summarization," in Proceedings of the ACM International Conference on Multimedia Retrieval, Shanghai, China, 2015.
    @INPROCEEDINGS{MazloomICMR15,
      author = {Masoud Mazloom and Amirhossein Habibian and Dong Liu and Cees G. M. Snoek and Shih-Fu Chang},
      title = {Encoding Concept Prototypes for Video Event Detection and Summarization},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},
      month = {June},
      year = {2015},
      pages = {},
      address = {Shanghai, China},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mazloom-concept-prototypes-icmr2015.pdf},
      abstract = { This paper proposes a new semantic video representation for few and zero example event detection and unsupervised video event summarization. Different from existing works, which obtain a semantic representation by training concepts over images or entire video clips, we propose an algorithm that learns a set of relevant frames as the concept prototypes from web video examples, without the need for frame-level annotations, and use them for representing an event video. We formulate the problem of learning the concept prototypes as seeking the frames closest to the densest region in the feature space of video frames from both positive and negative training videos of a target concept. We study the behavior of our video event representation based on concept prototypes by performing three experiments on challenging web videos from the TRECVID 2013 multimedia event detection task and the MED-summaries dataset. Our experiments establish that i) Event detection accuracy increases when mapping each video into concept prototype space. ii) Zero-example event detection increases by analyzing each frame of a video individually in concept prototype space, rather than considering the holistic videos. iii) Unsupervised video event summarization using concept prototypes is more accurate than using video-level concept detectors. }
    }
  21. Pascal Mettes, Jan C. van Gemert, Spencer Cappallo, Thomas Mensink, and Cees G. M. Snoek, "Bag-of-Fragments: Selecting and encoding video fragments for event detection and recounting," in Proceedings of the ACM International Conference on Multimedia Retrieval, Shanghai, China, 2015.
    @INPROCEEDINGS{MettesICMR15,
      author = {Pascal Mettes and Jan C. van Gemert and Spencer Cappallo and Thomas Mensink and Cees G. M. Snoek},
      title = {Bag-of-Fragments: Selecting and encoding video fragments for event detection and recounting},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},
      month = {June},
      year = {2015},
      pages = {},
      address = {Shanghai, China},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mettes-bag-of-fragments-icmr2015.pdf},
      abstract = { The goal of this paper is event detection and recounting using a representation of concept detector scores. Different from existing work, which encodes videos by averaging concept scores over all frames, we propose to encode videos using fragments that are discriminatively learned per event. Our bag-of-fragments split a video into semantically coherent fragment proposals. From training video proposals we show how to select the most discriminative fragment for an event. An encoding of a video is in turn generated by matching and pooling these discriminative fragments to the fragment proposals of the video. The bag-of-fragments forms an effective encoding for event detection and is able to provide a precise temporally localized event recounting. Furthermore, we show how bag-of-fragments can be extended to deal with irrelevant concepts in the event recounting. Experiments on challenging web videos show that i) our modest number of fragment proposals give a high sub-event recall, ii) bag-of-fragments is complementary to global averaging and provides better event detection, iii) bag-of-fragments with concept filtering yields a desirable event recounting. We conclude that fragments matter for video event detection and recounting. }
    }
  22. Mihir Jain, Jan C. van Gemert, and Cees G. M. Snoek, "What do 15,000 object categories tell us about classifying and localizing actions?," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015.
    @INPROCEEDINGS{JainCVPR15,
      author = {Mihir Jain and Jan C. van Gemert and Cees G. M. Snoek},
      title = {What do 15,000 object categories tell us about classifying and localizing actions?},
      booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},
      month = {June},
      year = {2015},
      pages = {},
      address = {Boston, MA, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/jain-objects-actions-cvpr2015.pdf},
      data = {https://staff.fnwi.uva.nl/m.jain/projects/15kObjectsForAction.html},
      abstract = { This paper contributes to automatic classification and localization of human actions in video. Whereas motion is the key ingredient in modern approaches, we assess the benefits of having objects in the video representation. Rather than considering a handful of carefully selected and localized objects, we conduct an empirical study on the benefit of encoding 15,000 object categories for action using 6 datasets totaling more than 200 hours of video and covering 180 action classes. Our key contributions are i) the first in-depth study of encoding objects for actions, ii) we show that objects matter for actions, and are often semantically relevant as well. iii) We establish that actions have object preferences. Rather than using all objects, selection is advantageous for action recognition. iv) We reveal that object-action relations are generic, which allows to transferring these relationships from the one domain to the other. And, v) objects, when combined with motion, improve the state-of-the-art for both action classification and localization. }
    }
  23. Amirhossein Habibian, Thomas Mensink, and Cees G. M. Snoek, "VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events," in Proceedings of the ACM International Conference on Multimedia, Orlando, Florida, USA, 2014, pp. 17-26.
    Best paper award
    @INPROCEEDINGS{HabibianMM14,
      author = {Amirhossein Habibian and Thomas Mensink and Cees G. M. Snoek},
      title = {VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia},
      pages = {17--26},
      month = {November},
      year = {2014},
      address = {Orlando, Florida, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/habibian-videostory-mm2014.pdf},
      note = {Best paper award},
      abstract = { This paper proposes a new video representation for few-example event recognition and translation. Different from existing representations, which rely on either low-level features, or pre-specified attributes, we propose to learn an embedding from videos and their descriptions. In our embedding, which we call VideoStory, correlated term labels are combined if their combination improves the video classifier prediction. Our proposed algorithm prevents the combination of correlated terms which are visually dissimilar by optimizing a joint-objective balancing descriptiveness and predictability. The algorithm learns from textual descriptions of video content, which we obtain for free from the web by a simple spidering procedure. We use our VideoStory representation for few-example recognition of events on more than 65K challenging web videos from the NIST TRECVID event detection task and the Columbia Consumer Video collection. Our experiments establish that i) VideoStory outperforms an embedding without joint-objective and alternatives without any embedding, ii) The varying quality of input video descriptions from the web is compensated by harvesting more data, iii) VideoStory sets a new state-of-the-art for few-example event recognition, outperforming very recent attribute and low-level motion encodings. What is more, VideoStory translates a previously unseen video to its most likely description from visual content only. }
    }
  24. Zhenyang Li, Efstratios Gavves, Thomas Mensink, and Cees G. M. Snoek, "Attributes Make Sense on Segmented Objects," in European Conference on Computer Vision, Zürich, Switzerland, 2014.
    @INPROCEEDINGS{LiECCV14,
      author = {Zhenyang Li and Efstratios Gavves and Thomas Mensink and Cees G. M. Snoek},
      title = {Attributes Make Sense on Segmented Objects},
      booktitle = {European Conference on Computer Vision},
      pages = {},
      month = {September},
      year = {2014},
      address = {Z\"urich, Switzerland},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-object-level-attributes-eccv2014.pdf},
      abstract = { In this paper we aim for object classification and segmentation by attributes. Where existing work considers attributes either for the global image or for the parts of the object, we propose, as our first novelty, to learn and extract attributes on segments containing the entire object. Object-level attributes suffer less from accidental content around the object and accidental image conditions such as partial occlusions, scale changes and viewpoint changes. As our second novelty, we propose joint learning for simultaneous object classification and segment proposal ranking, solely on the basis of attributes. This naturally brings us to our third novelty: object-level attributes for zero-shot, where we use attribute descriptions of unseen classes for localizing their instances in new images and classifying them accordingly. Results on the Caltech UCSD Birds, Leeds Butterflies, and an a-Pascal subset demonstrate that i) extracting attributes on oracle object-level brings substantial benefits ii) our joint learning model leads to accurate attribute-based classification and segmentation, approaching the oracle results and iii) object-level attributes also allow for zero-shot classification and segmentation. We conclude that attributes make sense on segmented objects. }
    }
  25. Mihir Jain, Jan C. van Gemert, Hervé Jégou, Patrick Bouthemy, and Cees G. M. Snoek, "Action Localization by Tubelets from Motion," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Ohio, USA, 2014.
    @INPROCEEDINGS{JainCVPR14,
      author = {Mihir Jain and Jan C. van Gemert and Herv\'e J\'egou and Patrick Bouthemy and Cees G. M. Snoek},
      title = {Action Localization by Tubelets from Motion},
      booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},
      month = {June},
      year = {2014},
      pages = {},
      address = {Columbus, Ohio, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/jain-tubelets-cvpr2014.pdf},
      abstract = { This paper considers the problem of action localization, where the objective is to determine when and where certain actions appear. We introduce a sampling strategy to produce 2D+t sequences of bounding boxes, called tubelets. Compared to state-of-the-art alternatives, this drastically reduces the number of hypotheses that are likely to include the action of interest. Our method is inspired by a recent technique introduced in the context of image localization. Beyond considering this technique for the first time for videos, we revisit this strategy for 2D+t sequences obtained from super-voxels. Our sampling strategy advantageously exploits a criterion that reflects how action related motion deviates from background motion. We demonstrate the interest of our approach by extensive experiments on two public datasets: UCF Sports and MSR-II. Our approach significantly outperforms the state-of-theart on both datasets, while restricting the search of actions to a fraction of possible bounding box sequences. }
    }
  26. Thomas Mensink, Efstratios Gavves, and Cees G. M. Snoek, "COSTA: Co-Occurrence Statistics for Zero-Shot Classification," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Ohio, USA, 2014.
    @INPROCEEDINGS{MensinkCVPR14,
      author = {Thomas Mensink and Efstratios Gavves and Cees G. M. Snoek},
      title = {COSTA: Co-Occurrence Statistics for Zero-Shot Classification},
      booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},
      month = {June},
      year = {2014},
      pages = {},
      address = {Columbus, Ohio, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mensink-co-occurrence-cvpr2014.pdf},
      abstract = { In this paper we aim for zero-shot classification, that is visual recognition of an unseen class by using knowledge transfer from known classes. Our main contribution is COSTA, which exploits co-occurrences of visual concepts in images for knowledge transfer. These inter-dependencies arise naturally between concepts, and are easy to obtain from existing annotations or web-search hit counts. We estimate a classifier for a new label, as a weighted combination of related classes, using the co-occurrences to define the weight. We propose various metrics to leverage these co-occurrences, and a regression model for learning a weight for each related class. We also show that our zero-shot classifiers can serve as priors for few-shot learning. Experiments on three multi-labeled datasets reveal that our proposed zero-shot methods, are approaching and occasionally outperforming fully supervised SVMs. We conclude that co-occurrence statistics suffice for zero-shot classification. }
    }
  27. Koen E. A. van de Sande, Cees G. M. Snoek, and Arnold W. M. Smeulders, "Fisher and VLAD with FLAIR," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Ohio, USA, 2014.
    @INPROCEEDINGS{SandeCVPR14,
      author = {Koen E. A. van de Sande and Cees G. M. Snoek and Arnold W. M. Smeulders},
      title = {Fisher and VLAD with FLAIR},
      booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},
      month = {June},
      year = {2014},
      pages = {},
      address = {Columbus, Ohio, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/sande-flair-cvpr2014.pdf},
      abstract = { A major computational bottleneck in many current algorithms is the evaluation of arbitrary boxes. Dense local analysis and powerful bag-of-word encodings, such as Fisher vectors and VLAD, lead to improved accuracy at the expense of increased computation time. Where a simplification in the representation is tempting, we exploit novel representations while maintaining accuracy. We start from state-of-the-art, fast selective search, but our method will apply to any initial box-partitioning. By representing the picture as sparse integral images, one per codeword, we achieve a Fast Local Area Independent Representation. FLAIR allows for very fast evaluation of any box encoding and still enables spatial pooling. In FLAIR we achieve exact VLADs difference coding, even with l2 and power-norms. Finally, by multiple codeword assignments, we achieve exact and approximate Fisher vectors with FLAIR. The results are a 18x speedup, which enables us to set a new state-of-the- art on the challenging 2010 PASCAL VOC objects and the fine-grained categorization of the CUB-2011 200 bird species. Plus, we rank number one in the official ImageNet 2013 detection challenge. }
    }
  28. Ran Tao, Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders, "Locality in Generic Instance Search from One Example," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Ohio, USA, 2014.
    @INPROCEEDINGS{TaoCVPR14,
      author = {Ran Tao and Efstratios Gavves and Cees G. M. Snoek and Arnold W. M. Smeulders},
      title = {Locality in Generic Instance Search from One Example},
      booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},
      month = {June},
      year = {2014},
      pages = {},
      address = {Columbus, Ohio, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/tao-locality-cvpr2014.pdf},
      abstract = { This paper aims for generic instance search from a single example. Where the state-of-the-art relies on global image representation for the search, we proceed by including locality at all steps of the method. As the first novelty, we consider many boxes per database image as candidate targets to search locally in the picture using an efficient point-indexed representation. The same representation allows, as the second novelty, the application of very large vocabularies in the powerful Fisher vector and VLAD to search locally in the feature space. As the third novelty we propose an exponential similarity function to further emphasize locality in the feature space. Locality is advantageous in instance search as it will rest on the matching unique details. We demonstrate a substantial increase in generic instance search performance from one example on three standard datasets with buildings, logos, and scenes from 0.443 to 0.620 in mAP. }
    }
  29. Julien van Hout, Eric Yeh, Dennis Koelma Cees G. M. Snoek, Chen Sun, Ramakant Nevatia, Julie Wong, and Gregory Myers, "Late Fusion and Calibration for Multimedia Event Detection Using Few Examples," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Florence, Italy, 2014.
    @INPROCEEDINGS{vanHoutICASSP14,
      author = {Julien van Hout and Eric Yeh and Dennis Koelma Cees G. M. Snoek and Chen Sun and Ramakant Nevatia and Julie Wong and Gregory Myers},
      title = {Late Fusion and Calibration for Multimedia Event Detection Using Few Examples},
      booktitle = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing},
      month = {May},
      year = {2014},
      pages = {},
      address = {Florence, Italy},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/hout-fusion-calibration-icassp2014.pdf},
      abstract = { The state-of-the-art in example-based multimedia event detection (MED) rests on heterogeneous classifiers whose scores are typically combined in a late-fusion scheme. Recent studies on this topic have failed to reach a clear consensus as to whether machine learning techniques can outperform rule-based fusion schemes with varying amount of training data. In this paper, we present two parametric approaches to late fusion: a normalization scheme for arithmetic mean fusion (logistic averaging) and a fusion scheme based on logistic regression, and compare them to widely used rule-based fusion schemes. We also describe how logistic regression can be used to calibrate the fused detection scores to predict an optimal threshold given a detection prior and costs on errors. We discuss the advantages and shortcomings of each approach when the amount of positives available for training varies from 10 positives (10Ex) to 100 positives (100Ex). Experiments were run using video data from the NIST TRECVID MED 2013 evaluation and results were reported in terms of a ranking metric: the mean average precision (mAP) and R0, a cost-based metric introduced in TRECVID MED 2013. }
    }
  30. Amirhossein Habibian, Thomas Mensink, and Cees G. M. Snoek, "Composite Concept Discovery for Zero-Shot Video Event Detection," in Proceedings of the ACM International Conference on Multimedia Retrieval, Glasgow, UK, 2014.
    @INPROCEEDINGS{HabibianICMR14long,
      author = {Amirhossein Habibian and Thomas Mensink and Cees G. M. Snoek},
      title = {Composite Concept Discovery for Zero-Shot Video Event Detection},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},
      month = {April},
      year = {2014},
      pages = {},
      address = {Glasgow, UK},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/habibian-composite-icmr14.pdf},
      abstract = { We consider automated detection of events in video without the use of any visual training examples. A common approach is to represent videos as classification scores obtained from a vocabulary of pre-trained concept classifiers. Where others construct the vocabulary by training individual concept classifiers, we propose to train classifiers for combination of concepts composed by Boolean logic operators. We call these concept combinations composite concepts and contribute an algorithm that automatically discovers them from existing video-level concept annotations. We discover composite concepts by jointly optimizing the accuracy of concept classifiers and their effectiveness for detecting events. We demonstrate that by combining concepts into composite concepts, we can train more accurate classifiers for the concept vocabulary, which leads to improved zero-shot event detection. Moreover, we demonstrate that by using different logic operators, namely ?AND?, ?OR?, we discover different types of composite concepts, which are complementary for zero-shot event detection. We perform a search for 20 events in 41K web videos from two test sets of the challenging TRECVID Multimedia Event Detection 2013 corpus. The experiments demonstrate the superior performance of the discovered composite concepts, compared to present-day alternatives, for zero-shot event detection. }
    }
  31. Amirhossein Habibian and Cees G. M. Snoek, "Stop-Frame Removal Improves Web Video Classification," in Proceedings of the ACM International Conference on Multimedia Retrieval, Glasgow, UK, 2014.
    @INPROCEEDINGS{HabibianICMR14short,
      author = {Amirhossein Habibian and Cees G. M. Snoek},
      title = {Stop-Frame Removal Improves Web Video Classification},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},
      month = {April},
      year = {2014},
      pages = {},
      address = {Glasgow, UK},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/habibian-stopframe-icmr14.pdf},
      abstract = { Web videos available in sharing sites like YouTube, are becoming an alternative to manually annotated training data, which are necessary for creating video classifiers. However, when looking into web videos, we observe they contain several irrelevant frames that may randomly appear in any video, i.e., blank and over exposed frames. We call these irrelevant frames stop-frames and propose a simple algorithm to identify and exclude them during classifier training. Stop-frames might appear in any video, so it is hard to recognize their category. Therefore we identify stop-frames as those frames, which are commonly misclassified by any concept classifier. Our experiments demonstrates that using our algorithm improves classification accuracy by 60% and 24% in terms of mean average precision for an event and concept detection benchmark. }
    }
  32. Masoud Mazloom, Xirong Li, and Cees G. M. Snoek, "Few-Example Video Event Retrieval Using Tag Propagation," in Proceedings of the ACM International Conference on Multimedia Retrieval, Glasgow, UK, 2014.
    @INPROCEEDINGS{MazloomICMR14,
      author = {Masoud Mazloom and Xirong Li and Cees G. M. Snoek},
      title = {Few-Example Video Event Retrieval Using Tag Propagation},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},
      month = {April},
      year = {2014},
      pages = {},
      address = {Glasgow, UK},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mazloom-tagpropagation-icmr14.pdf},
      abstract = { An emerging topic in multimedia retrieval is to detect a complex event in video using only a handful of video examples. Different from existing work which learns a ranker from positive video examples and hundreds of negative examples, we aim to query web video for events using zero or only a few visual examples. To that end, we propose in this paper a tag-based video retrieval system which propagates tags from a tagged video source to an unlabeled video collection without the need of any training examples. Our algorithm is based on weighted frequency neighbor voting using concept vector similarity. Once tags are propagated to unlabeled video we can rely on off-the-shelf language models to rank these videos by the tag similarity. We study the behavior of our tag-based video event retrieval system by performing three experiments on web videos from the TRECVID multimedia event detection corpus, with zero, one and multiple query examples that beats a recent alternative. }
    }
  33. Chen Sun, Brian Burns, Ram Nevatia, Cees G. M. Snoek, Bob Bolles, Greg Myers, Wen Wang, and Eric Yeh, "ISOMER: Informative Segment Observations for Multimedia Event Recounting," in Proceedings of the ACM International Conference on Multimedia Retrieval, Glasgow, UK, 2014.
    @INPROCEEDINGS{SunICMR14,
      author = {Chen Sun and Brian Burns and Ram Nevatia and Cees G. M. Snoek and Bob Bolles and Greg Myers and Wen Wang and Eric Yeh},
      title = {ISOMER: Informative Segment Observations for Multimedia Event Recounting},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},
      month = {April},
      year = {2014},
      pages = {},
      address = {Glasgow, UK},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/sun-informative-segment-icmr14.pdf},
      abstract = { This paper describes a system for multimedia event detection and recounting. The goal is to detect a high level event class in unconstrained web videos and generate event oriented summarization for display to users. For this purpose, we detect informative segments and collect observations for them, leading to our ISOMER system. We combine a large collection of both low level and semantic level visual and audio features for event detection. For event recounting, we propose a novel approach to identify event oriented discriminative video segments and their descriptions with a linear SVM event classifier. User friendly concepts including objects, actions, scenes, speech and optical character recognition are used in generating descriptions. We also develop several mapping and filtering strategies to cope with noisy concept detectors. Our system performed competitively in the TRECVID 2013 Multimedia Event Detection task with near 100,000 videos and was the highest performer in TRECVID 2013 Multimedia Event Recounting task. }
    }
  34. Efstratios Gavves, Basura Fernando, Cees G. M. Snoek, Arnold W. M. Smeulders, and Tinne Tuytelaars, "Fine-Grained Categorization by Alignments," in Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 2013.
    @INPROCEEDINGS{GavvesICCV13,
      author = {Efstratios Gavves and Basura Fernando and Cees G. M. Snoek and Arnold W. M. Smeulders and Tinne Tuytelaars},
      title = {Fine-Grained Categorization by Alignments},
      booktitle = {Proceedings of the {IEEE} International Conference on Computer Vision},
      pages = {},
      month = {December},
      year = {2013},
      address = {Sydney, Australia},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/gavves-fine-grained-alignment-iccv13.pdf},
      abstract = { The aim of this paper is fine-grained categorization without human interaction. Different from prior work, which relies on detectors for specific object parts, we propose to localize distinctive details by roughly aligning the objects using just the overall shape, since implicit to fine-grained categorization is the existence of a super-class shape shared among all classes. The alignments are then used to transfer part annotations from training images to test images (supervised alignment), or to blindly yet consistently segment the object in a number of regions (unsupervised alignment). We furthermore argue that in the distinction of fine-grained sub-categories, classification-oriented encodings like Fisher vectors are better suited for describing localized information than popular matching oriented features like HOG. We evaluate the method on the CU-2011 Birds and Stanford Dogs fine-grained datasets, outperforming the state-of-the-art. }
    }
  35. Zhenyang Li, Efstratios Gavves, Koen E. A. van de Sande, Cees G. M. Snoek, and Arnold W. M. Smeulders, "Codemaps Segment, Classify and Search Objects Locally," in Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 2013.
    @INPROCEEDINGS{LiICCV13,
      author = {Zhenyang Li and Efstratios Gavves and Koen E. A. van de Sande and Cees G. M. Snoek and Arnold W. M. Smeulders},
      title = {Codemaps Segment, Classify and Search Objects Locally},
      booktitle = {Proceedings of the {IEEE} International Conference on Computer Vision},
      pages = {},
      month = {December},
      year = {2013},
      address = {Sydney, Australia},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-codemaps-iccv13.pdf},
      abstract = { In this paper we aim for segmentation and classification of objects. We propose codemaps that are a joint formulation of the classification score and the local neighborhood it belongs to in the image. We obtain the codemap by reordering the encoding, pooling and classification steps over lattice elements. Other than existing linear decompositions who emphasize only the efficiency benefits for localized search, we make three novel contributions. As a preliminary, we provide a theoretical generalization of the sufficient mathematical conditions under which image encodings and classification becomes locally decomposable. As first novelty we introduce l2 normalization for arbitrarily shaped image regions, which is fast enough for semantic segmentation using our Fisher codemaps. Second, using the same lattice across images, we propose kernel pooling which embeds nonlinearities into codemaps for object classification by explicit or approximate feature mappings. Results demonstrate that l2 normalized Fisher codemaps improve the state-of-the-art in semantic segmentation for PASCAL VOC. For object classification the addition of nonlinearities brings us on par with the state-of-the-art, but is 3x faster. Because of the codemaps? inherent efficiency, we can reach significant speed-ups for localized search as well. We exploit the efficiency gain for our third novelty: object segment retrieval using a single query image only. }
    }
  36. Xirong Li and Cees G. M. Snoek, "Classifying Tag Relevance with Relevant Positive and Negative Examples," in Proceedings of the ACM International Conference on Multimedia, Barcelona, Spain, 2013.
    @INPROCEEDINGS{LiACM13,
      author = {Xirong Li and Cees G. M. Snoek},
      title = {Classifying Tag Relevance with Relevant Positive and Negative Examples},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia},
      month = {October},
      year = {2013},
      pages = {},
      address = {Barcelona, Spain},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-classifying-tag-relevance-mm2013.pdf},
      abstract = { Image tag relevance estimation aims to automatically determine what people label about images is factually present in the pictorial content. Different from previous works, which either use only positive examples of a given tag or use positive and random negative examples, we argue the importance of relevant positive and relevant negative examples for tag relevance estimation. We propose a system that selects positive and negative examples, deemed most relevant with respect to the given tag from crowd-annotated images. While applying models for many tags could be cumbersome, our system trains efficient ensembles of Support Vector Machines per tag, enabling fast classification. Experiments on two benchmark sets show that the proposed system compares favorably against five present day methods. Given extracted visual features, for each image our system can process up to 3,787 tags per second. The new system is both effective and efficient for tag relevance estimation. }
    }
  37. Masoud Mazloom, Amirhossein Habibian, and Cees G. M. Snoek, "Querying for Video Events by Semantic Signatures from Few Examples," in Proceedings of the ACM International Conference on Multimedia, Barcelona, Spain, 2013.
    @INPROCEEDINGS{MazloomACM13,
      author = {Masoud Mazloom and Amirhossein Habibian and Cees G. M. Snoek},
      title = {Querying for Video Events by Semantic Signatures from Few Examples},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia},
      month = {October},
      year = {2013},
      pages = {},
      address = {Barcelona, Spain},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mazloom-query-by-semantic-mm13.pdf},
      abstract = { We aim to query web video for complex events using only a handful of video query examples, where the standard approach learns a ranker from hundreds of examples. We consider a semantic signature representation, consisting of off-the-shelf concept detectors, to capture the variance in semantic appearance of events. Since it is unknown what similarity metric and query fusion to use in such an event retrieval setting, we perform three experiments on unconstrained web videos from the TRECVID event detection task. It reveals that: retrieval with semantic signatures using normalized correlation as similarity metric outperforms a low-level bag-of-words alternative, multiple queries are best combined using late fusion with an average operator, and event retrieval is preferred over event classi cation when less than eight positive video examples are available. }
    }
  38. Svetlana Kordumova, Xirong Li, and Cees G. M. Snoek, "Evaluating Sources and Strategies for Learning Video Concepts from Social Media," in International Workshop on Content-Based Multimedia Indexing, Veszprém, Hungary, 2013.
    @INPROCEEDINGS{KordumovaCBMI13,
      author = {Svetlana Kordumova and Xirong Li and Cees G. M. Snoek},
      title = {Evaluating Sources and Strategies for Learning Video Concepts from Social Media},
      booktitle = {International Workshop on Content-Based Multimedia Indexing},
      month = {June},
      year = {2013},
      pages = {},
      address = {Veszpr\'em, Hungary},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/kordumova-sources-strategies-cbmi2013.pdf},
      abstract = { Learning video concept detectors from social media sources, such as Flickr images and YouTube videos, has the potential to address a wide variety of concept queries for video search. While the potential has been recognized by many, and progress on the topic has been impressive, we argue that two key questions, i.e., What visual tagging source is most suited for selecting positive training examples to learn video concepts? and What strategy should be used for selecting positive examples from tagged sources?, remain open. As an initial attempt to answer the two questions, we conduct an experimental study using a video search engine which is capable of learning concept detectors from social media, be it socially tagged videos or socially tagged images. Within the video search engine we investigate six strategies of positive examples selection. The performance is evaluated on the challenging TRECVID benchmark 2011 with 400 hours of Internet videos. The new experiments lead to novel and nontrivial findings: (1) tagged images are a better source for learning video concepts from the web, (2) selecting tag relevant examples as positives for learning video concepts is always beneficial and it can be done automatically and (3) the best source and strategy compare favorably against several present-day methods. }
    }
  39. Amirhossein Habibian, Koen E. A. van de Sande, and Cees G. M. Snoek, "Recommendations for Video Event Recognition Using Concept Vocabularies," in Proceedings of the ACM International Conference on Multimedia Retrieval, Dallas, Texas, USA, 2013, pp. 89-96.
    @INPROCEEDINGS{HabibianICMR13,
      author = {Amirhossein Habibian and Koen E. A. van de Sande and Cees G. M. Snoek},
      title = {Recommendations for Video Event Recognition Using Concept Vocabularies},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},
      month = {April},
      year = {2013},
      pages = {89--96},
      address = {Dallas, Texas, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/habibian-vocabulary-recommendations-events-icmr2013.pdf},
      data = {http://www.science.uva.nl/research/mediamill/datasets/index.php},
      abstract = { Representing videos using vocabularies composed of concept detectors appears promising for event recognition. While many have recently shown the benefits of concept vocabularies for recognition, the important question what concepts to include in the vocabulary is ignored. In this paper, we study how to create an effective vocabulary for arbitrary event recognition in web video. We consider four research questions related to the number, the type, the specificity and the quality of the detectors in concept vocabularies. A rigorous experimental protocol using a pool of 1,346 concept detectors trained on publicly available annotations, a dataset containing 13,274 web videos from the Multimedia Event Detection benchmark, 25 event groundtruth definitions, and a state-of-the-art event recognition pipeline allow us to analyze the performance of various concept vocabulary definitions. From the analysis we arrive at the recommendation that for effective event recognition the concept vocabulary should i) contain more than 200 concepts, ii) be diverse by covering object, action, scene, people, animal and attribute concepts, iii) include both general and specific concepts, and iv) increase the number of concepts rather than improve the quality of the individual detectors. We consider the recommendations for video event recognition using concept vocabularies the most important contribution of the paper, as they provide guidelines for future work. }
    }
  40. Masoud Mazloom, Efstratios Gavves, Koen E. A. van de Sande, and Cees G. M. Snoek, "Searching Informative Concept Banks for Video Event Detection," in Proceedings of the ACM International Conference on Multimedia Retrieval, Dallas, Texas, USA, 2013, pp. 255-262.
    @INPROCEEDINGS{MazloomICMR13,
      author = {Masoud Mazloom and Efstratios Gavves and Koen E. A. van de Sande and Cees G. M. Snoek},
      title = {Searching Informative Concept Banks for Video Event Detection},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},
      month = {April},
      year = {2013},
      pages = {255--262},
      address = {Dallas, Texas, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mazloom-concept-banks-icmr2013.pdf},
      abstract = { An emerging trend in video event detection is to learn an event from a bank of concept detector scores. Different from existing work, which simply relies on a bank containing all available detectors, we propose in this paper an algorithm that learns from examples what concepts in a bank are most informative per event. We model finding this bank of informative concepts out of a large set of concept detectors as a rare event search. Our proposed approximate solution finds the optimal concept bank using a cross-entropy optimization. We study the behavior of video event detection based on a bank of informative concepts by performing three experiments on more than 1,000 hours of arbitrary internet video from the TRECVID multimedia event detection task. Starting from a concept bank of 1,346 detectors we show that 1.) some concept banks are more informative than others for specific events, 2.) event detection using an automatically obtained informative concept bank is more robust than using all available concepts, 3.) even for small amounts of training examples an informative concept bank outperforms a full bank and a bag-of-word event representation, and 4.) we show qualitatively that the informative concept banks make sense for the events of interest, without being programmed to do so. We conclude that for concept banks it pays to be informative. }
    }
  41. Davide Modolo and Cees G. M. Snoek, "Can Object Detectors Aid Internet Video Event Retrieval?," in Proceedings of the IS&T/SPIE Symposium on Electronic Imaging, San Francisco, CA, USA, 2013.
    @INPROCEEDINGS{ModoloSPIE13,
      author = {Davide Modolo and Cees G. M. Snoek},
      title = {Can Object Detectors Aid Internet Video Event Retrieval?},
      booktitle = {Proceedings of the IS\&T/SPIE Symposium on Electronic Imaging},
      pages = {},
      month = {February},
      year = {2013},
      address = {San Francisco, CA, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/modolo-object-event-spie2013.pdf},
      abstract = { The problem of event representation for automatic event detection in Internet videos is acquiring an increasing importance, due to their applicability to a large number of applications. Existing methods focus on representing events in terms of either low-level descriptors or domain-speci c models suited for a limited class of video only, ignoring the high-level meaning of the events. Ultimately aiming for a more robust and meaningful representation, in this paper we question whether object detectors can aid video event retrieval. We propose an experimental study that investigates the utility of present-day local and global object detectors for video event search. By evaluating object detectors optimized for high-quality photographs on low-quality Internet video, we establish that present-day detectors can successfully be used for recognizing objects in web videos. We use an object-based representation to re-rank the results of an appearance-based event detector. Results on the challenging TRECVID multimedia event detection corpus demonstrate that objects can indeed aid event retrieval. While much remains to be studied, we believe that our experimental study is a rst step towards revealing the potential of object-based event representations. }
    }
  42. Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders, "Convex Reduction of High-Dimensional Kernels for Visual Classification," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, Rhode Island, USA, 2012.
    @INPROCEEDINGS{GavvesCVPR12,
      author = {Efstratios Gavves and Cees G. M. Snoek and Arnold W. M. Smeulders},
      title = {Convex Reduction of High-Dimensional Kernels for Visual Classification},
      booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},
      pages = {},
      month = {June},
      year = {2012},
      address = {Providence, Rhode Island, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/gavves-convex-kernel-cvpr2012.pdf},
      abstract = { Limiting factors of fast and effective classifiers for large sets of images are their dependence on the number of images analyzed and the dimensionality of the image representation. Considering the growing number of images as a given, we aim to reduce the image feature dimensionality in this paper. We propose reduced linear kernels that use only a portion of the dimensions to reconstruct a linear kernel. We formulate the search for these dimensions as a convex optimization problem, which can be solved efficiently. Different from existing kernel reduction methods, our reduced kernels are faster and maintain the accuracy benefits from non-linear embedding methods that mimic non-linear SVMs. We show these properties on both the Scenes and PASCAL VOC 2007 datasets. In addition, we demonstrate how our reduced kernels allow to compress Fisher vector for use with non-linear embeddings, leading to high accuracy. What is more, without using any labeled examples the selected and weighed kernel dimensions appear to correspond to visually meaningful patches in the images. }
    }
  43. Xirong Li, Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders, "Fusing Concept Detection and Geo Context for Visual Search," in Proceedings of the ACM International Conference on Multimedia Retrieval, Hong Kong, China, 2012.
    Best paper runner-up
    @INPROCEEDINGS{LiICMR12,
      author = {Xirong Li and Cees G. M. Snoek and Marcel Worring and Arnold W. M. Smeulders},
      title = {Fusing Concept Detection and Geo Context for Visual Search},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},
      month = {June},
      year = {2012},
      pages = {},
      address = {Hong Kong, China},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-geo-context-icmr2012.pdf},
      note = {Best paper runner-up},
      abstract = { Given the proliferation of geo-tagged images, the question of how to exploit geo tags and the underlying geo context for visual search is emerging. Based on the observation that the importance of geo context varies over concepts, we propose a concept-based image search engine which fuses visual concept detection and geo context in a concept-dependent manner. Compared to individual content-based and geo-based concept detectors and their uniform combination, concept-dependent fusion shows improvements. Moreover, since the proposed search engine is trained on social-tagged images alone without the need of human interaction, it is flexible to cope with many concepts. Search experiments on 101 popular visual concepts justify the viability of the proposed solution. In particular, for 79 out of the 101 concepts, the learned weights yield improvements over the uniform weights, with a relative gain of at least 5\% in terms of average precision. }
    }
  44. Daan T. J. Vreeswijk, Koen E. A. van de Sande, Cees G. M. Snoek, and Arnold W. M. Smeulders, "All Vehicles are Cars: Subclass Preferences in Container Concepts," in Proceedings of the ACM International Conference on Multimedia Retrieval, Hong Kong, China, 2012.
    @INPROCEEDINGS{VreeswijkICMR12,
      author = {Daan T. J. Vreeswijk and Koen E. A. van de Sande and Cees G. M. Snoek and Arnold W. M. Smeulders},
      title = {All Vehicles are Cars: Subclass Preferences in Container Concepts},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},
      month = {June},
      year = {2012},
      pages = {},
      address = {Hong Kong, China},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/vreeswijk-vehicles-are-cars-icmr2012.pdf},
      abstract = { This paper investigates the natural bias humans display when labeling images with a container label like vehicle or carnivore. Using three container concepts as subtree root nodes, and all available concepts between these roots and the images from the ImageNet Large Scale Visual Recogni- tion Challenge (ILSVRC) dataset, we analyze the differences between the images labeled at these varying levels of abstraction and the union of their constituting leaf nodes. We find that for many container concepts, a strong preference for one or a few different constituting leaf nodes occurs. These results indicate that care is needed when using hierarchical knowledge in image classification: if the aim is to classify vehicles the way humans do, then cars and buses may be the only correct results. }
    }
  45. Bauke Freiburg, Jaap Kamps, and Cees G. M. Snoek, "Crowdsourcing Visual Detectors for Video Search," in Proceedings of the ACM International Conference on Multimedia, Scottsdale, AZ, USA, 2011.
    @INPROCEEDINGS{FreiburgACM11,
      author = {Bauke Freiburg and Jaap Kamps and Cees G. M. Snoek},
      title = {Crowdsourcing Visual Detectors for Video Search},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia},
      month = {December},
      year = {2011},
      pages = {},
      address = {Scottsdale, AZ, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/freiburg-crowdsourcing-acm2011.pdf},
      abstract = { In this paper, we study social tagging at the video fragment-level using a combination of automated content understanding and the wisdom of the crowds. We are interested in the question whether crowdsourcing can be beneficial to a video search engine that automatically recognizes video fragments on a semantic level. To answer this question, we perform a 3-month online field study with a concert video search engine targeted at a dedicated user-community of pop concert enthusiasts. We harvest the feedback of more than 500 active users and perform two experiments. In experiment 1 we measure user incentive to provide feedback, in experiment 2 we determine the tradeoff between feedback quality and quantity when aggregated over multiple users. Results show that users provide sufficient feedback, which becomes highly reliable when a crowd agreement of 67\% is enforced. }
    }
  46. Xirong Li, Efstratios Gavves, Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders, "Personalizing Automated Image Annotation using Cross-Entropy," in Proceedings of the ACM International Conference on Multimedia, Scottsdale, AZ, USA, 2011.
    @INPROCEEDINGS{LiACM11,
      author = {Xirong Li and Efstratios Gavves and Cees G. M. Snoek and Marcel Worring and Arnold W. M. Smeulders},
      title = {Personalizing Automated Image Annotation using Cross-Entropy},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia},
      month = {December},
      year = {2011},
      pages = {},
      address = {Scottsdale, AZ, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-personalized-acm2011.pdf},
      abstract = { Annotating the increasing amounts of user-contributed images in a personalized manner is in great demand. However, this demand is largely ignored by the mainstream of automated image annotation research. In this paper we aim for personalizing automated image annotation by jointly exploiting personalized tag statistics and content-based image annotation. We propose a cross-entropy based learning algorithm which personalizes a generic annotation model by learning from a users multimedia tagging history. Using cross-entropy-minimization basedMonte Carlo sampling, the proposed algorithm optimizes the personalization process in terms of a performance measurement which can be flexibly chosen. Automatic image annotation experiments with 5,315 realistic users in the social web show that the proposed method compares favorably to a generic image annotation method and a method using personalized tag statistics only. For 4,442 users the performance improves, where for 1,088 users the absolute performance gain is at least 0.05 in terms of average precision. The results show the value of the proposed method. }
    }
  47. Xirong Li, Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders, "Social Negative Bootstrapping for Visual Categorization," in Proceedings of the ACM International Conference on Multimedia Retrieval, Trento, Italy, 2011.
    @INPROCEEDINGS{LiICMR11,
      author = {Xirong Li and Cees G. M. Snoek and Marcel Worring and Arnold W. M. Smeulders},
      title = {Social Negative Bootstrapping for Visual Categorization},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia Retrieval},
      month = {April},
      year = {2011},
      pages = {},
      address = {Trento, Italy},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-social-negative-icmr2011.pdf},
      abstract = { To learn classifiers for many visual categories, obtaining labeled training examples in an efficient way is crucial. Since a classifier tends to misclassify negative examples which are visually similar to positive examples, inclusion of such informative negatives should be stressed in the learning process. However, they are unlikely to be hit by random sampling, the de facto standard in literature. In this paper, we go beyond random sampling by introducing a novel social negative bootstrapping approach. Given a visual category and a few positive examples, the proposed approach adaptively and iteratively harvests informative negatives from a large amount of social-tagged images. To label negative examples without human interaction, we design an effective virtual labeling procedure based on simple tag reasoning. Virtual labeling, in combination with adaptive sampling, enables us to select the most misclassified negatives as the informative samples. Learning from the positive set and the informative negative sets results in visual classifiers with higher accuracy. Experiments on two present-day image benchmarks employing 650K virtually labeled negative examples show the viability of the proposed approach. On a popular visual categorization benchmark our precision at 20 increases by 34\%, compared to baselines trained on randomly sampled negatives. We achieve more accurate visual categorization without the need of manually labeling any negatives. }
    }
  48. Wolfgang Hürst, Cees G. M. Snoek, Willem-Jan Spoel, and Mate Tomin, "Size Matters! How Thumbnail Number, Size, and Motion Influence Mobile Video Retrieval," in International Conference on MultiMedia Modeling, Taipei, Taiwan, 2011.
    @INPROCEEDINGS{HurstMMM11,
      author = {Wolfgang H\"urst and Cees G. M. Snoek and Willem-Jan Spoel and Mate Tomin},
      title = {Size Matters! How Thumbnail Number, Size, and Motion Influence Mobile Video Retrieval},
      booktitle = {International Conference on MultiMedia Modeling},
      month = {January},
      year = {2011},
      pages = {},
      address = {Taipei, Taiwan},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/huerst-size-matters-mmm2011.pdf},
      demo = {http://vimeo.com/19595895},
      abstract = { Various interfaces for video browsing and retrieval have been proposed that provide improved usability, better retrieval performance, and richer user experience compared to simple result lists that are just sorted by relevance. These browsing interfaces take advantage of the rather large screen estate on desktop and laptop PCs to visualize advanced configurations of thumbnails summarizing the video content. Naturally, the usefulness of such screen-intensive visual browsers can be called into question when applied on small mobile handheld devices, such as smart phones. In this paper, we address the usefulness of thumbnail images for mobile video retrieval interfaces. In particular, we investigate how thumbnail number, size, and motion influence the performance of humans in common recognition tasks. Contrary to widespread believe that screens of handheld devices are unsuited for visualizing multiple (small) thumbnails simultaneously, our study shows that users are quite able to handle and assess multiple small thumbnails at the same time, especially when they show moving images. Our results give suggestions for appropriate video retrieval interface designs on handheld devices. }
    }
  49. Efstratios Gavves and Cees G. M. Snoek, "Landmark Image Retrieval Using Visual Synonyms," in Proceedings of the ACM International Conference on Multimedia, Firenze, Italy, 2010.
    @INPROCEEDINGS{GavvesACM10,
      author = {Efstratios Gavves and Cees G. M. Snoek},
      title = {Landmark Image Retrieval Using Visual Synonyms},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia},
      month = {October},
      year = {2010},
      pages = {},
      address = {Firenze, Italy},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/gavves-synonyms-acm10.pdf},
      abstract = { In this paper, we consider the incoherence problem of the visual words in bag-of-words vocabularies. Different from existing work, which performs assignment of words based solely on closeness in descriptor space, we focus on identifying pairs of independent, distant words -- the visual synonyms -- that are still likely to host image patches with similar appearance. To study this problem, we focus on landmark images, where we can examine whether image geometry is an appropriate vehicle for detecting visual synonyms. We propose an algorithm for the extraction of visual synonyms in landmark images. To show the merit of visual synonyms, we perform two experiments. We examine closeness of synonyms in descriptor space and we show a first application of visual synonyms in a landmark image retrieval setting. Using visual synonyms, we perform on par with the state-of-the-art, but with six times less visual words. }
    }
  50. Wolfgang Hürst, Cees G. M. Snoek, Willem-Jan Spoel, and Mate Tomin, "Keep Moving! Revisiting Thumbnails for Mobile Video Retrieval," in Proceedings of the ACM International Conference on Multimedia, Firenze, Italy, 2010.
    @INPROCEEDINGS{HurstACM10,
      author = {Wolfgang H\"urst and Cees G. M. Snoek and Willem-Jan Spoel and Mate Tomin},
      title = {Keep Moving! Revisiting Thumbnails for Mobile Video Retrieval},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia},
      month = {October},
      year = {2010},
      pages = {},
      address = {Firenze, Italy},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/huerst-keep-moving-acm2010.pdf},
      demo = {http://vimeo.com/19595895},
      abstract = { Motivated by the increasing popularity of video on handheld devices and the resulting importance for effective video retrieval, this paper revisits the relevance of thumbnails in a mobile video retrieval setting. Our study indicates that users are quite able to handle and assess small thumbnails on a mobile's screen -- especially with moving images -- suggesting promising avenues for future research in design of mobile video retrieval interfaces. }
    }
  51. Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek, "Accelerating Visual Categorization with the GPU," in ECCV Workshop on Computer Vision on GPU, Crete, Greece, 2010.
    @INPROCEEDINGS{SandeCVGPU10,
      author = {Koen E. A. van de Sande and Theo Gevers and Cees G. M. Snoek},
      title = {Accelerating Visual Categorization with the {GPU}},
      booktitle = {{ECCV} Workshop on Computer Vision on {GPU}},
      pages = {},
      month = {September},
      year = {2010},
      address = {Crete, Greece},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/sande-accelerating-categorization-CVGPU2010.pdf},
      abstract = { Visual categorization is important to manage large collections of digital images and video, where textual meta-data is often incomplete or simply unavailable. The bag-of-words model has become the most powerful method for visual categorization of images and video. Despite its high accuracy, a severe drawback of this model is its high computational cost. As the trend to increase computational power in newer CPU and GPU architectures is to increase their level of parallelism, exploiting this parallelism becomes an important direction to handle the computational cost of the bag-of-words approach. In this paper, we analyze the bag-of-words model for visual categorization in terms of computational cost and identify two major bottlenecks: the quantization step and the classification step. We address these two bottlenecks by proposing two efficient algorithms for quantization and classification by exploiting the GPU hardware and the CUDA parallel programming model. The algorithms are designed to keep categorization accuracy intact and give the same numerical results. In the experiments on large scale datasets it is shown that, by using a parallel implementation on the GPU, quantization is 28 times faster and classification is 35 times faster than a single-threaded CPU version, while giving the exact same numerical results. The GPU accelerations are applicable to both the learning phase and the testing phase of visual categorization systems. For software visit http://www.colordescriptors.com/. }
    }
  52. Bouke Huurnink, Cees G. M. Snoek, Maarten de Rijke, and Arnold W. M. Smeulders, "Today’s and Tomorrow’s Retrieval Practice in the Audiovisual Archive," in Proceedings of the ACM International Conference on Image and Video Retrieval, Xi’an, China, 2010, pp. 18-25.
    Best paper runner-up
    @INPROCEEDINGS{HuurninkCIVR10,
      author = {Bouke Huurnink and Cees G. M. Snoek and Maarten {de Rijke} and Arnold W. M. Smeulders},
      title = {Today's and Tomorrow's Retrieval Practice in the Audiovisual Archive},
      booktitle = {Proceedings of the {ACM} International Conference on Image and Video Retrieval},
      pages = {18--25},
      month = {July},
      year = {2010},
      address = {Xi'an, China},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/huurnink-archive-civr2010.pdf},
      data = {http://ilps.science.uva.nl/resources/avarchive},
      note = {Best paper runner-up},
      abstract = { Content-based video retrieval is maturing to the point where it can be used in real-world retrieval practices. One such practice is the audiovisual archive, whose users increasingly require fine-grained access to broadcast television content. We investigate to what extent content-based video retrieval methods can improve search in the audiovisual archive. In particular, we propose an evaluation methodology tailored to the specific needs and circumstances of the audiovisual archive, which are typically missed by existing evaluation initiatives. We utilize logged searches and content purchases from an existing audiovisual archive to create realistic query sets and relevance judgments. To reflect the retrieval practice of both the archive and the video retrieval community as closely as possible, our experiments with three video search engines incorporate archive-created catalog entries as well as state-of-the-art multimedia content analysis results. We find that incorporating content-based video retrieval into the archives practice results in significant performance increases for shot retrieval and for retrieving entire television programs. Our experiments also indicate that individual content-based retrieval methods yield approximately equal performance gains. We conclude that the time has come for audiovisual archives to start accommodating content-based video retrieval methods into their daily practice. }
    }
  53. Xirong Li, Cees G. M. Snoek, and Marcel Worring, "Unsupervised Multi-Feature Tag Relevance Learning for Social Image Retrieval," in Proceedings of the ACM International Conference on Image and Video Retrieval, Xi’an, China, 2010, pp. 10-17.
    Best paper award
    @INPROCEEDINGS{LiCIVR10,
      author = {Xirong Li and Cees G. M. Snoek and Marcel Worring},
      title = {Unsupervised Multi-Feature Tag Relevance Learning for Social Image Retrieval},
      booktitle = {Proceedings of the {ACM} International Conference on Image and Video Retrieval},
      pages = {10--17},
      month = {July},
      year = {2010},
      address = {Xi'an, China},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-multifeature-civr10.pdf},
      note = {Best paper award},
      abstract = { Interpreting the relevance of a user-contributed tag with respect to the visual content of an image is an emerging problem in social image retrieval. In the literature this problem is tackled by analyzing the correlation between tags and images represented by specific visual features. Unfortunately, no single feature represents the visual content completely, e.g., global features are suitable for capturing the gist of scenes, while local features are better for depicting objects. To solve the problem of learning tag relevance given multiple features, we introduce in this paper two simple and effective methods: one is based on the classical Borda Count and the other is a method we name UniformTagger. Both methods combine the output of many tag relevance learners driven by diverse features in an unsupervised, rather than supervised, manner. Experiments on 3.5 million social-tagged images and two test sets verify our proposal. Using learned tag relevance as updated tag frequency for social image retrieval, both Borda Count and UniformTagger outperform retrieval without tag relevance learning and retrieval with single-feature tag relevance learning. Moreover, the two unsupervised methods are comparable to a state-of-the-art supervised alternative, but without the need of any training data. }
    }
  54. Xirong Li and Cees G. M. Snoek, "Visual Categorization with Negative Examples for Free," in Proceedings of the ACM International Conference on Multimedia, Beijing, China, 2009.
    @INPROCEEDINGS{LiACM09,
      author = {Xirong Li and Cees G. M. Snoek},
      title = {Visual Categorization with Negative Examples for Free},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia},
      pages = {},
      month = {October},
      year = {2009},
      address = {Beijing, China},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-negative-for-free-acm2009.pdf},
      data = {http://staff.science.uva.nl/~xirong/neg4free/},
      abstract = { Automatic visual categorization is critically dependent on labeled examples for supervised learning. As an alternative to traditional expert labeling, social-tagged multimedia is becoming a novel yet subjective and inaccurate source of learning examples. Different from existing work focusing on collecting positive examples, we study in this paper the potential of substituting social tagging for expert labeling for creating negative examples. We present an empirical study using 6.5 million Flickr photos as a source of social tagging. Our experiments on the PASCAL VOC challenge 2008 show that with a relative loss of only 4.3\% in terms of mean average precision, expert-labeled negative examples can be completely replaced by social-tagged negative examples for consumer photo categorization. }
    }
  55. Arjan T. Setz and Cees G. M. Snoek, "Can Social Tagged Images Aid Concept-Based Video Search?," in Proceedings of the IEEE International Conference on Multimedia & Expo, New York, NY, USA, 2009, pp. 1460-1463.
    @INPROCEEDINGS{SetzICME09,
      author = {Arjan T. Setz and Cees G. M. Snoek},
      title = {Can Social Tagged Images Aid Concept-Based Video Search?},
      booktitle = {Proceedings of the {IEEE} International Conference on Multimedia \& Expo},
      pages = {1460--1463},
      month = {June--July},
      year = {2009},
      address = {New York, NY, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/setz-social-tags-icme2009.pdf},
      abstract = { This paper seeks to unravel whether commonly available social tagged images can be exploited as a training resource for concept-based video search. Since social tags are known to be ambiguous, overly personalized, and often error prone, we place special emphasis on the role of disambiguation. We present a systematic experimental study that evaluates concept detectors based on social tagged images, and their disambiguated versions, in three application scenarios: within-domain, cross-domain, and together with an interacting user. The results indicate that social tagged images can aid concept-based video search indeed, especially after disambiguation and when used in an interactive video retrieval setting. These results open-up interesting avenues for future research. }
    }
  56. Xirong Li, Cees G. M. Snoek, and Marcel Worring, "Annotating Images by Harnessing Worldwide User-Tagged Photos," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Taipei, Taiwan, 2009.
    @INPROCEEDINGS{LiICASSP09,
      author = {Xirong Li and Cees G. M. Snoek and Marcel Worring},
      title = {Annotating Images by Harnessing Worldwide User-Tagged Photos},
      booktitle = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing},
      pages = {},
      month = {April},
      year = {2009},
      address = {Taipei, Taiwan},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-annotating-images-icassp2009.pdf},
      abstract = { Automatic image tagging is important yet challenging due to the semantic gap and the lack of learning examples to model a tag's visual diversity. Meanwhile, social user tagging is creating rich multimedia content on the web. In this paper, we propose to combine the two tagging approaches in a search-based framework. For an unlabeled image, we first retrieve its visual neighbors from a large user-tagged image database. We then select relevant tags from the result images to annotate the unlabeled image. To tackle the unreliability and sparsity of user tagging, we introduce a joint-modality tag relevance estimation method which efficiently addresses both textual and visual clues. Experiments on 1.5 million Flickr photos and 10 000 Corel images verify the proposed method. }
    }
  57. Daragh Byrne, Aiden R. Doherty, Cees G. M. Snoek, Gareth J. F. Jones, and Alan F. Smeaton, "Validating the Detection of Everyday Concepts in Visual Lifelogs," in Proceedings of the International Conference on Semantic and Digital Media Technologies, SAMT 2008, Koblenz, Germany, December 3-5, 2008, , pp. 15-30.
    @INPROCEEDINGS{ByrneSAMT08,
      author = {Daragh Byrne and Aiden R. Doherty and Cees G. M. Snoek and Gareth J. F. Jones and Alan F. Smeaton},
      title = {Validating the Detection of Everyday Concepts in Visual Lifelogs},
      booktitle = {Proceedings of the International Conference on Semantic and Digital Media Technologies, SAMT 2008, Koblenz, Germany, December 3-5, 2008},
      editor = {David Duke and Lynda Hardman and Alex Hauptmann and Dietrich Paulus and Steffen Staab},
      series = {LNCS},
      volume = {5392},
      pages = {15--30},
      publisher = {Springer-Verlag},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/byrne-everyday-concepts-samt2008.pdf},
      address = {},
      abstract = { The Microsoft SenseCam is a small lightweight wearable camera used to passively capture photos and other sensor readings from a user's day-to-day activities. It can capture up to 3,000 images per day, equating to almost 1 million images per year. It is used to aid memory by creating a personal multimedia lifelog, or visual recording of the wearer's life. However the sheer volume of image data captured within a visual lifelog creates a number of challenges, particularly for locating relevant content. Within this work, we explore the applicability of semantic concept detection, a method often used within video retrieval, on the novel domain of visual lifelogs. A concept detector models the correspondence between low-level visual features and high-level semantic concepts (such as indoors, outdoors, people, buildings, etc.) using supervised machine learning. By doing so it determines the probability of a concept's presence. We apply detection of 27 everyday semantic concepts on a lifelog collection composed of 257,518 SenseCam images from 5 users. The results were then evaluated on a subset of 95,907 images, to determine the precision for detection of each semantic concept and to draw some interesting inferences on the lifestyles of those 5 users. We additionally present future applications of concept detection within the domain of lifelogging. }
    }
  58. Xirong Li, Cees G. M. Snoek, and Marcel Worring, "Learning Tag Relevance by Neighbor Voting for Social Image Retrieval," in Proceedings of the ACM International Conference on Multimedia Information Retrieval, Vancouver, Canada, 2008, pp. 180-187.
    @INPROCEEDINGS{LiMIR08,
      author = {Xirong Li and Cees G. M. Snoek and Marcel Worring},
      title = {Learning Tag Relevance by Neighbor Voting for Social Image Retrieval},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia Information Retrieval},
      pages = {180--187},
      month = {October},
      year = {2008},
      address = {Vancouver, Canada},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-tag-relevance-mir2008.pdf},
      data = {http://staff.science.uva.nl/~xirong/tagrel/},
      abstract = { Social image retrieval is important for exploiting the increasing amounts of amateur-tagged multimedia such as Flickr images. Since amateur tagging is known to be uncontrolled, ambiguous, and personalized, a fundamental problem is how to reliably interpret the relevance of a tag with respect to the visual content it is describing. Intuitively, if different persons label similar images using the same tags, these tags are likely to reflect objective aspects of the visual content. Starting from this intuition, we propose a novel algorithm that scalably and reliably learns tag relevance by accumulating votes from visually similar neighbors. Further, treated as tag frequency, learned tag relevance is seamlessly embedded into current tag-based social image retrieval paradigms. Preliminary experiments on one million Flickr images demonstrate the potential of the proposed algorithm. Overall comparisons for both single-word queries and multiple-word queries show substantial improvement over the baseline by learning and using tag relevance. Specifically, compared with the baseline using the original tags, on average, retrieval using improved tags increases mean average precision by 24\%, from 0.54 to 0.67. Moreover, simulated experiments indicate that performance can be improved further by scaling up the amount of images used in the proposed neighbor voting algorithm. }
    }
  59. Ork de Rooij, Cees G. M. Snoek, and Marcel Worring, "Balancing Thread Based Navigation for Targeted Video Search," in Proceedings of the ACM International Conference on Image and Video Retrieval, Niagara Falls, Canada, 2008, pp. 485-494.
    @INPROCEEDINGS{RooijCIVR08,
      author = {Ork de Rooij and Cees G. M. Snoek and Marcel Worring},
      title = {Balancing Thread Based Navigation for Targeted Video Search},
      booktitle = {Proceedings of the {ACM} International Conference on Image and Video Retrieval},
      pages = {485--494},
      month = {July},
      year = {2008},
      address = {Niagara Falls, Canada},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/rooij-thread-based-navigation-civr2008.pdf},
      abstract = { Various query methods for video search exist. Because of the semantic gap each method has its limitations. We argue that for effective retrieval query methods need to be combined at retrieval time. However, switching query methods often involves a change in query and browsing interface, which puts a heavy burden on the user. In this paper, we propose a novel method for fast and effective search trough large video collections by embedding multiple query methods into a single browsing environment. To that end we introduced the notion of query threads, which contain a shot-based ranking of the video collection according to some feature-based similarity measure. On top of these threads we define several thread-based visualizations, ranging from fast targeted search to very broad exploratory search, with the ForkBrowser as the balance between fast search and video space exploration. We compare the effectiveness and efficiency of the ForkBrowser with the CrossBrowser on the TRECVID 2007 interactive search task. Results show that different query methods are needed for different types of search topics, and that the ForkBrowser requires signifficantly less user interactions to achieve the same result as the CrossBrowser. In addition, both browsers rank among the best interactive retrieval systems currently available. }
    }
  60. Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek, "A Comparison of Color Features for Visual Concept Classification," in Proceedings of the ACM International Conference on Image and Video Retrieval, Niagara Falls, Canada, 2008, pp. 141-149.
    @INPROCEEDINGS{SandeCIVR08,
      author = {Koen E. A. van de Sande and Theo Gevers and Cees G. M. Snoek},
      title = {A Comparison of Color Features for Visual Concept Classification},
      booktitle = {Proceedings of the {ACM} International Conference on Image and Video Retrieval},
      pages = {141--149},
      month = {July},
      year = {2008},
      address = {Niagara Falls, Canada},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/sande-colorfeatures-civr2008.pdf},
      software = {http://staff.science.uva.nl/~ksande/research/colordescriptors/},
      abstract = { Concept classification is important to access visual information on the level of objects and scene types. So far, intensity-based features have been widely used. To increase discriminative power, color features have been proposed only recently. As many features exist, a structured overview is required of color features in the context of concept classification. Therefore, this paper studies 1. the invariance properties and 2. the distinctiveness of color features in a structured way. The invariance properties of color features with respect to photometric changes are summarized. The distinctiveness of color features is assessed experimentally using an image and a video benchmark: the PASCAL VOC Challenge 2007 and the Mediamill Challenge. Because color features cannot be studied independently from the points at which they are extracted, different point sampling strategies based on Harris-Laplace salient points, dense sampling and the spatial pyramid are also studied. From the experimental results, it can be derived that invariance to light intensity changes and light color changes affects concept classification. The results reveal further that the usefulness of invariance is concept-specific. }
    }
  61. Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek, "Evaluation of Color Descriptors for Object and Scene Recognition," in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Anchorage, Alaska, USA, 2008.
    @INPROCEEDINGS{SandeCVPR08,
      author = {Koen E. A. van de Sande and Theo Gevers and Cees G. M. Snoek},
      title = {Evaluation of Color Descriptors for Object and Scene Recognition},
      booktitle = {Proceedings of the {IEEE} Computer Society Conference on Computer Vision and Pattern Recognition},
      pages = {},
      month = {June},
      year = {2008},
      address = {Anchorage, Alaska, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/sande-colordescriptors-cvpr2008.pdf},
      software = {http://staff.science.uva.nl/~ksande/research/colordescriptors/},
      abstract = { Image category recognition is important to access visual information on the level of objects and scene types. So far, intensity-based descriptors have been widely used. To increase illumination invariance and discriminative power, color descriptors have been proposed only recently. As many descriptors exist, a structured overview of color invariant descriptors in the context of image category recognition is required. Therefore, this paper studies the invariance properties and the distinctiveness of color descriptors in a structured way. The invariance properties of color descriptors are shown analytically using a taxonomy based on invariance properties with respect to photometric transformations. The distinctiveness of color descriptors is assessed experimentally using two benchmarks from the image domain and the video domain. From the theoretical and experimental results, it can be derived that invariance to light intensity changes and light color changes affects category recognition. The results reveal further that, for light intensity changes, the usefulness of invariance is category-specific. }
    }
  62. Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek, "Color Descriptors for Object Category Recognition," in Proceedings of the IS&T European Conference on Colour in Graphics, Imaging, and Vision, Terrassa-Barcelona, Spain, 2008.
    @INPROCEEDINGS{SandeCGIV08,
      author = {Koen E. A. van de Sande and Theo Gevers and Cees G. M. Snoek},
      title = {Color Descriptors for Object Category Recognition},
      booktitle = {Proceedings of the {IS\&T} European Conference on Colour in Graphics, Imaging, and Vision},
      pages = {},
      month = {June},
      year = {2008},
      address = {Terrassa-Barcelona, Spain},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/sande-color-descriptors-cgiv2008.pdf},
      abstract = { Category recognition is important to access visual information on the level of objects. A common approach is to compute image descriptors first and then to apply machine learning to achieve category recognition from annotated examples. As a consequence,the choice of image descriptors is of great influence on the recognition accuracy. So far, intensity-based (e.g. SIFT) descriptors computed at salient points have been used. However, color has been largely ignored. The question is, can color information improve accuracy of category recognition? Therefore, in this paper, we will extend both salient point detection and region description with color information. The extension of color descriptors is integrated into the framework of category recognition enabling to select both intensity and color variants. Our experiments on an image benchmark show that category recognition benefits from the use of color. Moreover, the combination of intensity and color descriptors yields a 30\% improvement over intensity features alone. }
    }
  63. Ork de Rooij, Cees G. M. Snoek, and Marcel Worring, "Query on Demand Video Browsing," in Proceedings of the ACM International Conference on Multimedia, Augsburg, Germany, 2007, pp. 811-814.
    @INPROCEEDINGS{RooijACM07,
      author = {Ork de Rooij and Cees G. M. Snoek and Marcel Worring},
      title = {Query on Demand Video Browsing},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia},
      pages = {811--814},
      month = {September},
      year = {2007},
      address = {Augsburg, Germany},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/rooij-rotor-acm2007.pdf},
      abstract = { This paper describes a novel method for browsing a large collection of news video by linking various forms of related video fragments together as threads. Each thread contains a sequence of shots with high feature-based similarity. Two interfaces are designed which use threads as the basis for browsing. One interface shows a minimal set of threads, and the other as many as possible. Both interfaces are evaluated in the TRECVID interactive retrieval task, where they ranked among the best interactive retrieval systems currently available. The results indicate that the use of threads in interactive video search is very beneficial. We have found that in general the query result and the timeline are the most important threads. However, having several additional threads allow a user to find unique results which cannot easily be found by using query results and time alone. }
    }
  64. Arnold W. M. Smeulders, Jan C. van Gemert, Bouke Huurnink, Dennis C. Koelma, Ork de Rooij, Koen E. A. van de Sande, Cees G. M. Snoek, Cor J. Veenman, and Marcel Worring, "Semantic Video Search," in International Conference on Image Analysis and Processing, Modena, Italy, 2007.
    @INPROCEEDINGS{SmeuldersICIAP07,
      author = {Arnold W. M. Smeulders and Jan C. van Gemert and Bouke Huurnink and Dennis C. Koelma and Ork de Rooij and Koen E. A. van de Sande and Cees G. M. Snoek and Cor J. Veenman and Marcel Worring},
      title = {Semantic Video Search},
      booktitle = {International Conference on Image Analysis and Processing},
      pages = {},
      month = {September},
      year = {2007},
      address = {Modena, Italy},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/smeulders-search-iciap2007.pdf},
      abstract = { In this paper we describe the current performance of our MediaMill system as presented in the TRECVID 2006 benchmark for video search engines. The MediaMill team participated in two tasks: concept detection and search. For concept detection we use the MediaMill Challenge as experimental platform. The MediaMill Challenge divides the generic video indexing problem into a visual-only, textual-only, early fusion, late fusion, and combined analysis experiment. We provide a baseline implementation for each experiment together with baseline results. We extract image features, on global, regional, and keypoint level, which we combine with various supervised learners. A late fusion approach of visual-only analysis methods using geometric mean was our most successful run. With this run we conquer the Challenge baseline by more than 50\%. Our concept detection experiments have resulted in the best score for three concepts: i.e. \emph{desert},
      \emph{flag us},
      and \emph{charts}. What is more, using LSCOM annotations, our visual-only approach generalizes well to a set of 491 concept detectors. To handle such a large thesaurus in retrieval, an engine is developed which allows users to select relevant concept detectors based on interactive browsing using advanced visualizations. Similar to previous years our best interactive search runs yield top performance, ranking 2nd and 6th overall. }
    }
  65. Cees G. M. Snoek, Marcel Worring, Arnold W. M. Smeulders, and Bauke Freiburg, "The Role of Visual Content and Style for Concert Video Indexing," in Proceedings of the IEEE International Conference on Multimedia & Expo, Beijing, China, 2007, pp. 252-255.
    @INPROCEEDINGS{SnoekICME07b,
      author = {Cees G. M. Snoek and Marcel Worring and Arnold W. M. Smeulders and Bauke Freiburg},
      title = {The Role of Visual Content and Style for Concert Video Indexing},
      booktitle = {Proceedings of the {IEEE} International Conference on Multimedia \& Expo},
      pages = {252--255},
      month = {July},
      year = {2007},
      address = {Beijing, China},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-fabchannel-icme2007.pdf},
      abstract = { This paper contributes to the automatic indexing of concert video. In contrast to traditional methods, which rely primarily on audio information for summarization applications, we explore how a visual-only concept detection approach could be employed. We investigate how our recent method for news video indexing -- which takes into account the role of content and style -- generalizes to the concert domain. We analyze concert video on three levels of visual abstraction, namely: content, style, and their fusion. Experiments with 12 concept detectors, on 45 hours of visually challenging concert video, show that the automatically learned best approach is concept-dependent. Moreover, these results suggest that the visual modality provides ample opportunity for more effective indexing and retrieval of concert video when used in addition to the auditory modality. }
    }
  66. Cees G. M. Snoek and Marcel Worring, "Are Concept Detector Lexicons Effective for Video Search?," in Proceedings of the IEEE International Conference on Multimedia & Expo, Beijing, China, 2007, pp. 1966-1969.
    @INPROCEEDINGS{SnoekICME07a,
      author = {Cees G. M. Snoek and Marcel Worring},
      title = {Are Concept Detector Lexicons Effective for Video Search?},
      booktitle = {Proceedings of the {IEEE} International Conference on Multimedia \& Expo},
      pages = {1966--1969},
      month = {July},
      year = {2007},
      address = {Beijing, China},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-concept-icme2007.pdf},
      abstract = { Until now, systematic studies on the effectiveness of concept detectors for video search have been carried out using less than 20 detectors, or in combination with other retrieval techniques. We investigate whether video search using just large concept detector lexicons is a viable alternative for present day approaches. We demonstrate that increasing the number of concept detectors in a lexicon yields improved video retrieval performance indeed. In addition, we show that combining concept detectors at query time has the potential to boost performance further. We obtain the experimental evidence on the automatic video search task of TRECVID 2005 using 363 machine learned concept detectors. }
    }
  67. Marcel Worring, Cees G. M. Snoek, Ork de Rooij, Giang P. Nguyen, and Arnold W. M. Smeulders, "The MediaMill Semantic Video Search Engine," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Honolulu, Hawaii, USA, 2007, pp. 1213-1216.
    @INPROCEEDINGS{WorringICASSP07,
      author = {Marcel Worring and Cees G. M. Snoek and Ork de Rooij and Giang P. Nguyen and Arnold W. M. Smeulders},
      title = {The {MediaMill} Semantic Video Search Engine},
      booktitle = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing},
      volume = {4},
      pages = {1213--1216},
      month = {April},
      year = {2007},
      address = {Honolulu, Hawaii, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/worring-mediamill-icassp2007.pdf},
      abstract = { In this paper we present the methods underlying the MediaMill semantic video search engine. The basis for the engine is a semantic indexing process which is currently based on a lexicon of 491 concept detectors. To support the user in navigating the collection, the system defines a visual similarity space, a semantic similarity space, a semantic thread space, and browsers to explore them. We compare the different browsers and their utility within the TRECVID benchmark. In 2005, We obtained a top-3 result for 19 out of 24 search topics. In 2006 for 14 out of 24. }
    }
  68. Cees G. M. Snoek, Marcel Worring, Jan C. van Gemert, Jan-Mark Geusebroek, and Arnold W. M. Smeulders, "The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia," in Proceedings of the ACM International Conference on Multimedia, Santa Barbara, USA, 2006, pp. 421-430.
    @INPROCEEDINGS{SnoekACM06,
      author = {Cees G. M. Snoek and Marcel Worring and Jan C. van Gemert and Jan-Mark Geusebroek and Arnold W. M. Smeulders},
      title = {The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia},
      pages = {421--430},
      month = {October},
      year = {2006},
      address = {Santa Barbara, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-challenge-acm2006.pdf},
      data = {http://www.mediamill.nl/challenge/},
      abstract = { We introduce the challenge problem for generic video indexing to gain insight in intermediate steps that affect performance of multimedia analysis methods, while at the same time fostering repeatability of experiments. To arrive at a challenge problem, we provide a general scheme for the systematic examination of automated concept detection methods, by decomposing the generic video indexing problem into 2 unimodal analysis experiments, 2 multimodal analysis experiments, and 1 combined analysis experiment. For each experiment, we evaluate generic video indexing performance on 85 hours of international broadcast news data, from the TRECVID 2005/2006 benchmark, using a lexicon of 101 semantic concepts. By establishing a minimum performance on each experiment, the challenge problem allows for component-based optimization of the generic indexing issue, while simultaneously offering other researchers a reference for comparison during indexing methodology development. To stimulate further investigations in intermediate analysis steps that influence video indexing performance, the challenge offers to the research community a manually annotated concept lexicon, pre-computed low-level multimedia features, trained classifier models, and five experiments together with baseline performance, which are all available at http://www.mediamill.nl/challenge/. }
    }
  69. Jan C. van Gemert, Cees G. M. Snoek, Cor Veenman, and Arnold W. M. Smeulders, "The Influence of Cross-Validation on Video Classification Performance," in Proceedings of the ACM International Conference on Multimedia, Santa Barbara, USA, 2006, pp. 695-698.
    @INPROCEEDINGS{GemertACM06,
      author = {Jan C. van Gemert and Cees G. M. Snoek and Cor Veenman and Arnold W. M. Smeulders},
      title = {The Influence of Cross-Validation on Video Classification Performance},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia},
      pages = {695--698},
      month = {October},
      year = {2006},
      address = {Santa Barbara, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/gemert-crossvalidation-acm2006.pdf},
      abstract = { Digital video is sequential in nature. When video data is used in a semantic concept classification task, the episodes are usually summarized with shots. The shots are annotated as containing, or not containing, a certain concept resulting in a labeled dataset. These labeled shots can subsequently be used by supervised learning methods (classifiers) where they are trained to predict the absence or presence of the concept in unseen shots and episodes. The performance of such automatic classification systems is usually estimated with cross-validation. By taking random samples from the dataset for training and testing as such, part of the shots from an episode are in the training set and another part from the same episode is in the test set. Accordingly, data dependence between training and test set is introduced, resulting in too optimistic performance estimates. In this paper, we experimentally show this bias, and propose how this bias can be prevented using "episode-constrained" cross-validation. Moreover, we show that a 15\% higher classifier performance can be achieved by using episode constrained cross-validation for classifier parameter tuning. }
    }
  70. Marcel Worring, Cees G. M. Snoek, Ork de Rooij, Giang P. Nguyen, and Dennis C. Koelma, "Lexicon-based Browsers for Searching in News Video Archives," in Proceedings of the International Conference on Pattern Recognition, Hong Kong, China, 2006, pp. 1256-1259.
    @INPROCEEDINGS{WorringICPR06,
      author = {Marcel Worring and Cees G. M. Snoek and Ork de Rooij and Giang P. Nguyen and Dennis C. Koelma},
      title = {Lexicon-based Browsers for Searching in News Video Archives},
      booktitle = {Proceedings of the International Conference on Pattern Recognition},
      pages = {1256--1259},
      month = {August},
      year = 2006, address = {Hong Kong, China},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/worring-browsers-icpr2006.pdf},
      abstract = { In this paper we present the methods and visualizations used in the MediaMill video search engine. The basis for the engine is a semantic indexing process which derives a lexicon of 101 concepts. To support the user in navigating the collection, the system defines a visual similarity space, a semantic similarity space, a semantic thread space, and browsers to explore them. The search system is evaluated within the TRECVID benchmark. We obtain a top-3 result for 19 out of 24 search topics. In addition, we obtain the highest mean average precision of all search participants. }
    }
  71. Cees G. M. Snoek, Marcel Worring, Dennis C. Koelma, and Arnold W. M. Smeulders, "Learned Lexicon-driven Interactive Video Retrieval," in Proceedings of the International Conference on Image and Video Retrieval, CIVR 2006, Tempe, Arizona, July 13-15, 2006, Heidelberg, Germany, 2006, pp. 11-20.
    @INPROCEEDINGS{SnoekCIVR06,
      author = {Cees G. M. Snoek and Marcel Worring and Dennis C. Koelma and Arnold W. M. Smeulders},
      title = {Learned Lexicon-driven Interactive Video Retrieval},
      booktitle = {Proceedings of the International Conference on Image and Video Retrieval, CIVR 2006, Tempe, Arizona, July 13-15, 2006},
      editor = {H. Sundaram and others},
      series = {LNCS},
      volume = {4071},
      pages = {11--20},
      publisher = {Springer-Verlag},
      address = {Heidelberg, Germany},
      month = {July},
      year = 2006, pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-lexicon-civr2006.pdf},
      demo = {http://isis-data.science.uva.nl/cgmsnoek/index.php/demonstrations/mediamill/},
      abstract = { We combine in this paper automatic learning of a large lexicon of semantic concepts with traditional video retrieval methods into a novel approach to narrow the semantic gap. The core of the proposed solution is formed by the automatic detection of an unprecedented lexicon of 101 concepts. From there, we explore the combination of query-by-concept, query-by-example, query-by-keyword, and user interaction into the \emph{MediaMill} semantic video search engine. We evaluate the search engine against the 2005 NIST TRECVID video retrieval benchmark, using an international broadcast news archive of 85 hours. Top ranking results show that the lexicon-driven search engine is highly effective for interactive video retrieval. }
    }
  72. Cees G. M. Snoek, Marcel Worring, Jan-Mark Geusebroek, Dennis C. Koelma, Frank J. Seinstra, and Arnold W. M. Smeulders, "The Semantic Pathfinder for Generic News Video Indexing," in Proceedings of the IEEE International Conference on Multimedia & Expo, Toronto, Canada, 2006, pp. 1469-1472.
    @INPROCEEDINGS{SnoekICME06,
      author = {Cees G. M. Snoek and Marcel Worring and Jan-Mark Geusebroek and Dennis C. Koelma and Frank J. Seinstra and Arnold W. M. Smeulders},
      title = {The Semantic Pathfinder for Generic News Video Indexing},
      booktitle = {Proceedings of the {IEEE} International Conference on Multimedia \& Expo},
      pages = {1469--1472},
      month = {July},
      year = {2006},
      address = {Toronto, Canada},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-pathfinder-icme2006.pdf},
      abstract = { This paper presents the semantic pathfinder architecture for generic indexing of video archives. The pathfinder automatically extracts semantic concepts from video based on the exploration of different paths through three consecutive analysis steps, closely linked to the video production process, namely: content analysis, style analysis, and context analysis. The virtue of the semantic pathfinder is its learned ability to find a best path of analysis steps on a per-concept basis. To show the generality of this indexing approach we develop detectors for a lexicon of 32 concepts and we evaluate the semantic pathfinder against the 2004 NIST TRECVID video retrieval benchmark, using a news archive of 64 hours. Top ranking performance indicates the merit of the semantic pathfinder. }
    }
  73. Jan C. van Gemert, Jan-Mark Geusebroek, Cor J. Veenman, Cees G. M. Snoek, and Arnold W. M. Smeulders, "Robust Scene Categorization by Learning Image Statistics in Context," in Int’l Workshop on Semantic Learning Applications in Multimedia, in conjunction with CVPR’06, New York, USA, 2006, pp. 105-112.
    @INPROCEEDINGS{GemertSLAM06,
      author = {Jan C. van Gemert and Jan-Mark Geusebroek and Cor J. Veenman and Cees G. M. Snoek and Arnold W. M. Smeulders},
      title = {Robust Scene Categorization by Learning Image Statistics in Context},
      booktitle = {Int'l Workshop on Semantic Learning Applications in Multimedia, in conjunction with {CVPR'06}},
      pages = {105--112},
      month = {June},
      year = {2006},
      address = {New York, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/gemert-scene-slam2006.pdf},
      abstract = { We present a generic and robust approach for scene categorization. A complex scene is described by proto-concepts like vegetation, water, fire, sky etc. These proto-concepts are represented by low level features, where we use natural images statistics to compactly represent color invariant texture information by a Weibull distribution. We introduce the notion of contextures which preserve the context of textures in a visual scene with an occurrence histogram (context) of similarities to proto-concept descriptors (texture). In contrast to a codebook approach, we use the similarity to all vocabulary elements to generalize beyond the code words. Visual descriptors are attained by combining different types of contexts with different texture parameters. The visual scene descriptors are generalized to visual categories by training a support vector machine. We evaluate our approach on 3 different datasets: 1) 50 categories for the TRECVID video dataset; 2) the Caltech 101-object images; 3) 89 categories being the intersection of the Corel photo stock with the Art Explosion photo stock. Results show that our approach is robust over different datasets, while maintaining competitive performance. }
    }
  74. Arnold W. M. Smeulders, Jan C. van Gemert, Jan-Mark Geusebroek, Cees G. M. Snoek, and Marcel Worring, "Browsing for the National Dutch Video Archive," in ISCCSP2006, Marrakech, Morocco, 2006.
    @INPROCEEDINGS{SmeuldersISCCSP06,
      author = {Arnold W. M. Smeulders and Jan C. van Gemert and Jan-Mark Geusebroek and Cees G. M. Snoek and Marcel Worring},
      title = {Browsing for the National {Dutch} Video Archive},
      booktitle = {ISCCSP2006},
      pages = {},
      month = {March},
      year = {2006},
      address = {Marrakech, Morocco},
      pdf = {http://www.science.uva.nl/~smeulder/pubs/ISCCSP2006SmeuldersTEMP.pdf},
      abstract = { Pictures have always been a prime carrier of Dutch culture. But pictures take a new form. We live in times of broad- and narrowcasting through Internet, of passive and active viewers, of direct or delayed broadcast, and of digital pictures being delivered in the museum or at home. At the same time, the picture and television archives turn digital. Archives are going to be swamped with information requests unless they swiftly adapt to partially automatic annotation and digital retrieval. Our aim is to provide faster and more complete access to picture archives by digital analysis. Our approach consists of a multi-media analysis of features of pictures in tandem with the language that describes those pictures, under the guidance of a visual ontology. The general scientific paradigm we address is the detection of directly observables fused into semantic features learned from large repositories of digital video. We use invariant, natural-image statisticsbased contextual feature sets for capturing the concepts of images and integrate that as early as possible with text. The system consists of a large for science yet small for practice set of visual concepts permitting the retrieval of semantically formulated queries. We will demonstrate a PC-based, off-line trained state of the art system for browsing broadcast news-archives. }
    }
  75. Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders, "Early versus Late Fusion in Semantic Video Analysis," in Proceedings of the ACM International Conference on Multimedia, Singapore, 2005, pp. 399-402.
    @INPROCEEDINGS{SnoekACM05a,
      author = {Cees G. M. Snoek and Marcel Worring and Arnold W. M. Smeulders},
      title = {Early versus Late Fusion in Semantic Video Analysis},
      booktitle = {Proceedings of the {ACM} International Conference on Multimedia},
      pages = {399--402},
      month = {November},
      year = {2005},
      address = {Singapore},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-earlylate-acm2005.pdf},
      abstract = { Semantic analysis of multimodal video aims to index segments of interest at a conceptual level. In reaching this goal, it requires an analysis of several information streams. At some point in the analysis these streams need to be fused. In this paper, we consider two classes of fusion schemes, namely early fusion and late fusion. The former fuses modalities in feature space, the latter fuses modalities in semantic space. We show by experiment on 184 hours of broadcast video data and for 20 semantic concepts, that late fusion tends to give slightly better performance for most concepts. However, for those concepts where early fusion performs better the difference is more significant. }
    }
  76. Cees G. M. Snoek, Marcel Worring, Jan-Mark Geusebroek, Dennis C. Koelma, and Frank J. Seinstra, "On the Surplus Value of Semantic Video Analysis Beyond the Key Frame," in Proceedings of the IEEE International Conference on Multimedia & Expo, Amsterdam, The Netherlands, 2005.
    @INPROCEEDINGS{SnoekICME05a,
      author = {Cees G. M. Snoek and Marcel Worring and Jan-Mark Geusebroek and Dennis C. Koelma and Frank J. Seinstra},
      title = {On the Surplus Value of Semantic Video Analysis Beyond the Key Frame},
      booktitle = {Proceedings of the {IEEE} International Conference on Multimedia \& Expo},
      pages = {},
      month = {July},
      year = {2005},
      address = {Amsterdam, The Netherlands},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-surplus-icme2005.pdf},
      abstract = { Typical semantic video analysis methods aim for classification of camera shots based on extracted features from a single key frame only. In this paper, we sketch a video analysis scenario and evaluate the benefit of analysis beyond the key frame for semantic concept detection performance. We developed detectors for a lexicon of 26 concepts, and evaluated their performance on 120 hours of video data. Results show that, on average, detection performance can increase with almost 40\% when the analysis method takes more visual content into account. }
    }
  77. Cees G. M. Snoek and Marcel Worring, "Multimedia Pattern Recognition in Soccer Video using Time Intervals," in Classification the Ubiquitous Challenge, Proceedings of the 28th Annual Conference of the Gesellschaft fur Klassifikation e.V., University of Dortmund, March 9-11, 2004, Berlin, Germany, 2005, pp. 97-108.
    @INPROCEEDINGS{SnoekGFKL05,
      author = {Cees G. M. Snoek and Marcel Worring},
      title = {Multimedia Pattern Recognition in Soccer Video using Time Intervals},
      booktitle = {Classification the Ubiquitous Challenge, Proceedings of the 28th Annual Conference of the Gesellschaft fur Klassifikation e.V., University of Dortmund, March 9-11, 2004},
      publisher = {Springer-Verlag},
      series = {Studies in Classification, Data Analysis, and Knowledge Organization},
      editors = {C. Weihs and W. Gaul},
      pages = {97--108},
      year = {2005},
      address = {Berlin, Germany},
      pdf = {},
      demo = {http://www.goalgle.com/},
      abstract = { In this paper we propose the Time Interval Multimedia Event (TIME) framework as a robust approach for recognition of multimedia patterns, e.g. highlight events, in soccer video. The representation used in TIME extends the Allen temporal interval relations and allows for proper inclusion of context and synchronization of the heterogeneous information sources involved in multimedia pattern recognition. For automatic classification of highlights in soccer video, we compare three different machine learning techniques, i.c. C4.5 decision tree, Maximum Entropy, and Support Vector Machine. It was found that by using the TIME framework the amount of video a user has to watch in order to see almost all highlights can be reduced considerably, especially in combination with a Support Vector Machine. }
    }
  78. Frank J. Seinstra, Cees G. M. Snoek, Dennis C. Koelma, Jan-Mark Geusebroek, and Marcel Worring, "User Transparent Parallel Processing of the 2004 NIST TRECVID Data Set," in Proceedings of the 19th IEEE International Parallel & Distributed Processing Symposium, Denver, USA, 2005, pp. 90-97.
    @INPROCEEDINGS{SeinstraIPDPS05,
      author = {Frank J. Seinstra and Cees G. M. Snoek and Dennis C. Koelma and Jan-Mark Geusebroek and Marcel Worring},
      title = {User Transparent Parallel Processing of the 2004 {NIST} {TRECVID} Data Set},
      booktitle = {Proceedings of the 19th IEEE International Parallel \& Distributed Processing Symposium},
      pages = {90--97},
      month = {April},
      year = {2005},
      address = {Denver, USA},
      pdf = {http://staff.science.uva.nl/~fjseins/Papers/Conferences/ipdps2005.pdf},
      abstract = { The Parallel-Horus framework, developed at the University of Amsterdam, is a unique software architecture that allows non-expert parallel programmers to develop fully sequential multimedia applications for efficient execution on homogeneous Beowulf-type commodity clusters. Previously obtained results for realistic, but relatively small-sized applications have shown the feasibility of the Parallel-Horus approach, with parallel performance consistently being found to be optimal with respect to the abstraction level of message passing programs. In this paper we discuss the most serious challenge Parallel-Horus has had to deal with so far: the processing of over 184 hours of video included in the 2004 NIST TRECVID evaluation, i.e. the de facto international standard benchmark for content-based video retrieval. Our results and experiences confirm that Parallel- Horus is a very powerful support-tool for state-of-the-art research and applications in multimedia processing. }
    }
  79. Cees G. M. Snoek, Marcel Worring, and Alexander G. Hauptmann, "Detection of TV News Monologues by Style Analysis," in Proceedings of the IEEE International Conference on Multimedia & Expo, Taipei, Taiwan, 2004.
    @INPROCEEDINGS{SnoekICME04,
      author = {Cees G. M. Snoek and Marcel Worring and Alexander G. Hauptmann},
      title = {Detection of {TV} News Monologues by Style Analysis},
      booktitle = {Proceedings of the {IEEE} International Conference on Multimedia \& Expo},
      pages = {},
      month = {June},
      year = {2004},
      address = {Taipei, Taiwan},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-style-icme2004.pdf},
      abstract = { We propose a method for detection of semantic concepts in produced video based on style analysis. Recognition of concepts is done by applying a classifier ensemble to the detected style elements. As a case study we present a method for detecting the concept of news subject monologues. Our approach had the best average precision performance amongst 26 submissions in the 2003 TRECVID benchmark. }
    }
  80. Cees G. M. Snoek and Marcel Worring, "Time Interval Maximum Entropy based Event Indexing in Soccer Video," in Proceedings of the IEEE International Conference on Multimedia & Expo, Baltimore, USA, 2003, pp. 481-484.
    @INPROCEEDINGS{SnoekICME03a,
      author = {Cees G. M. Snoek and Marcel Worring},
      title = {Time Interval Maximum Entropy based Event Indexing in Soccer Video},
      booktitle = {Proceedings of the {IEEE} International Conference on Multimedia \& Expo},
      pages = {481--484},
      month = {July},
      year = {2003},
      address = {Baltimore, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/icme2003.pdf},
      demo = {http://www.goalgle.com/},
      abstract = { Multimodal indexing of events in video documents poses problems with respect to representation, inclusion of contextual information, and synchronization of the heterogeneous information sources involved. In this paper we present the Time Interval Maximum Entropy (TIME) framework that tackles aforementioned problems. To demonstrate the viability of TIME for event classification in multimodal video, an evaluation was performed on the domain of soccer broadcasts. It was found that by applying TIME, the amount of video a user has to watch in order to see almost all highlights can be reduced considerably. }
    }
  81. Marcel Worring, Andrew Bagdanov, Jan C. van Gemert, Jan-Mark Geusebroek, Minh Hoang, Guus Schreiber, Cees G. M. Snoek, Jeroen Vendrig, Jan Wielemaker, and Arnold W. M. Smeulders, "Interactive Indexing and Retrieval of Multimedia Content," in Proceedings of the 29th Annual Conference on Current Trends in Theory and Practice of Informatics, Milovy, Czech Republic, 2002, pp. 135-148.
    @INPROCEEDINGS{WorringSOFSEM02,
      author = {Marcel Worring and Andrew Bagdanov and Jan C. van Gemert and Jan-Mark Geusebroek and Minh Hoang and Guus Schreiber and Cees G. M. Snoek and Jeroen Vendrig and Jan Wielemaker and Arnold W. M. Smeulders},
      title = {Interactive Indexing and Retrieval of Multimedia Content},
      booktitle = {Proceedings of the 29th Annual Conference on Current Trends in Theory and Practice of Informatics},
      series = {Lecture Notes in Computer Science},
      volume = {2540},
      pages = {135-148},
      publisher = {Springer-Verlag},
      year = {2002},
      address = {Milovy, Czech Republic},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/sofsem2002.pdf},
      abstract = { The indexing and retrieval of multimedia items is difficult due to the semantic gap between the user's perception of the data and the descriptions we can derive automatically from the data using computer vision, speech recognition, and natural language processing. In this contribution we consider the nature of the semantic gap in more detail and show examples of methods that help in limiting the gap. These methods can be automatic, but in general the indexing and retrieval of multimedia items should be a collaborative process between the system and the user. We show how to employ the user's interaction for limiting the semantic gap. }
    }
  82. Cees G. M. Snoek and Marcel Worring, "A Review on Multimodal Video Indexing," in Proceedings of the IEEE International Conference on Multimedia & Expo, Lausanne, Switzerland, 2002, pp. 21-24.
    @INPROCEEDINGS{SnoekICME02,
      author = {Cees G. M. Snoek and Marcel Worring},
      title = {A Review on Multimodal Video Indexing},
      booktitle = {Proceedings of the {IEEE} International Conference on Multimedia \& Expo},
      volume = {2},
      pages = {21--24},
      month = {August},
      year = {2002},
      address = {Lausanne, Switzerland},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/icme2002.pdf},
      abstract = { Efficient and effective handling of video documents depends on the availability of indexes. Manual indexing is unfeasible for large video collections. Efficient, single modality based, video indexing methods have appeared in literature. Effective indexing, however, requires a multimodal approach in which either the most appropriate modality is selected or the different modalities are used in collaborative fashion. In this paper we present a framework for multimodal video indexing, which views a video document from the perspective of its author. The framework serves as a blueprint for a generic and flexible multimodal video indexing system, and generalizes different state-of-the-art video indexing methods. It furthermore forms the basis for categorizing these different methods. }
    }

National Meetings

  1. Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders, "Convex Reduced Kernels for Visual Categorization," in Proceedings of the 18th Annual Conference of the Advanced School for Computing and Imaging, Rotterdam, The Netherlands, 2012.
    Best paper award
    @INPROCEEDINGS{GavvesASCI12,
      author = {Efstratios Gavves and Cees G. M. Snoek and Arnold W. M. Smeulders},
      title = {Convex Reduced Kernels for Visual Categorization},
      booktitle = {Proceedings of the 18th Annual Conference of the Advanced School for Computing and Imaging},
      pages = {},
      address = {Rotterdam, The Netherlands},
      month = {October},
      year = {2012},
      note = {Best paper award},
      pdf = {}
    }
  2. Amirhossein Habibian and Cees G. M. Snoek, "Stop-Frame Removal Improves Web Video Classification," in Proceedings of the 18th Annual Conference of the Advanced School for Computing and Imaging, Rotterdam, The Netherlands, 2012.
    Best poster award
    @INPROCEEDINGS{HabibianASCI12,
      author = {Amirhossein Habibian and Cees G. M. Snoek},
      title = {Stop-Frame Removal Improves Web Video Classification},
      booktitle = {Proceedings of the 18th Annual Conference of the Advanced School for Computing and Imaging},
      pages = {},
      address = {Rotterdam, The Netherlands},
      month = {October},
      year = {2012},
      note = {Best poster award},
      pdf = {}
    }
  3. Svetlana Kordumova, Xirong Li, and Cees G. M. Snoek, "Learning Concepts from the Web: Some Frames are More Important than Others," in Proceedings of the 18th Annual Conference of the Advanced School for Computing and Imaging, Rotterdam, The Netherlands, 2012.
    @INPROCEEDINGS{KordumovaASCI12,
      author = {Svetlana Kordumova and Xirong Li and Cees G. M. Snoek},
      title = {Learning Concepts from the Web: Some Frames are More Important than Others},
      booktitle = {Proceedings of the 18th Annual Conference of the Advanced School for Computing and Imaging},
      pages = {},
      address = {Rotterdam, The Netherlands},
      month = {October},
      year = {2012},
      pdf = {}
    }
  4. Masoud Mazloom, Efstratios Gavves, Koen E. A. van de Sande, and Cees G. M. Snoek, "Learning to Select Semantic Video Event Representations," in Proceedings of the 18th Annual Conference of the Advanced School for Computing and Imaging, Rotterdam, The Netherlands, 2012.
    @INPROCEEDINGS{MazloomASCI12,
      author = {Masoud Mazloom and Efstratios Gavves and Koen E. A. van de Sande and Cees G. M. Snoek},
      title = {Learning to Select Semantic Video Event Representations},
      booktitle = {Proceedings of the 18th Annual Conference of the Advanced School for Computing and Imaging},
      pages = {},
      address = {Rotterdam, The Netherlands},
      month = {October},
      year = {2012},
      pdf = {}
    }
  5. Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders, "Landmark Image Retrieval with Visual Synonyms," in Proceedings of the 16th Annual Conference of the Advanced School for Computing and Imaging, Veldhoven, The Netherlands, 2010.
    Best paper award
    @INPROCEEDINGS{GavvesASCI10,
      author = {Efstratios Gavves and Cees G. M. Snoek and Arnold W. M. Smeulders},
      title = {Landmark Image Retrieval with Visual Synonyms},
      booktitle = {Proceedings of the 16th Annual Conference of the Advanced School for Computing and Imaging},
      pages = {},
      address = {Veldhoven, The Netherlands},
      month = {November},
      year = {2010},
      note = {Best paper award},
      pdf = {}
    }
  6. Xirong Li, Cees G. M. Snoek, and Marcel Worring, "Combining Multi-feature Tag Relevance Learning for Social Image Retrieval," in Proceedings of the 16th Annual Conference of the Advanced School for Computing and Imaging, Veldhoven, The Netherlands, 2010.
    @INPROCEEDINGS{LiASCI10,
      author = {Xirong Li and Cees G. M. Snoek and Marcel Worring},
      title = {Combining Multi-feature Tag Relevance Learning for Social Image Retrieval},
      booktitle = {Proceedings of the 16th Annual Conference of the Advanced School for Computing and Imaging},
      pages = {},
      address = {Veldhoven, The Netherlands},
      month = {November},
      year = {2010},
      pdf = {}
    }
  7. Xirong Li, Cees G. M. Snoek, and Marcel Worring, "Tag Relevance Learning for Social Image Retrieval and Labeling," in Proceedings of the 15th Annual Conference of the Advanced School for Computing and Imaging, Zeewolde, The Netherlands, 2009.
    @INPROCEEDINGS{LiASCI09,
      author = {Xirong Li and Cees G. M. Snoek and Marcel Worring},
      title = {Tag Relevance Learning for Social Image Retrieval and Labeling},
      booktitle = {Proceedings of the 15th Annual Conference of the Advanced School for Computing and Imaging},
      pages = {},
      address = {Zeewolde, The Netherlands},
      month = {June},
      year = {2009},
      pdf = {}
    }
  8. Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek, "Empowering Visual Categorization with the GPU," in Proceedings of the 15th Annual Conference of the Advanced School for Computing and Imaging, Zeewolde, The Netherlands, 2009.
    @INPROCEEDINGS{SandeASCI09,
      author = {Koen E. A. van de Sande and Theo Gevers and Cees G. M. Snoek},
      title = {Empowering Visual Categorization with the {GPU}},
      booktitle = {Proceedings of the 15th Annual Conference of the Advanced School for Computing and Imaging},
      pages = {},
      address = {Zeewolde, The Netherlands},
      month = {June},
      year = {2009},
      pdf = {}
    }
  9. Ork de Rooij, Cees G. M. Snoek, and Marcel Worring, "Consuming Videos with the ForkBrowser," in Proceedings of the 14th Annual Conference of the Advanced School for Computing and Imaging, Heijen, The Netherlands, 2008.
    @INPROCEEDINGS{RooijASCI08,
      author = {Ork de Rooij and Cees G. M. Snoek and Marcel Worring},
      title = {Consuming Videos with the ForkBrowser},
      booktitle = {Proceedings of the 14th Annual Conference of the Advanced School for Computing and Imaging},
      pages = {},
      address = {Heijen, The Netherlands},
      month = {June},
      year = {2008},
      pdf = {}
    }
  10. Ork de Rooij, Cees G. M. Snoek, and Marcel Worring, "Multi Thread Video Browsing," in Proceedings of the 13th Annual Conference of the Advanced School for Computing and Imaging, Heijen, The Netherlands, 2007.
    @INPROCEEDINGS{RooijASCI07,
      author = {Ork de Rooij and Cees G. M. Snoek and Marcel Worring},
      title = {Multi Thread Video Browsing},
      booktitle = {Proceedings of the 13th Annual Conference of the Advanced School for Computing and Imaging},
      pages = {},
      address = {Heijen, The Netherlands},
      month = {June},
      year = {2007},
      pdf = {}
    }
  11. Jan C. van Gemert, Jan-Mark Geusebroek, Cor J. Veenman, and Cees G. M. Snoek, "Generic and Robust Scene Categorization by Learning Context," in Proceedings of the 12th Annual Conference of the Advanced School for Computing and Imaging, Lommel, Belgium, 2006.
    @INPROCEEDINGS{GemertASCI06,
      author = {Jan C. van Gemert and Jan-Mark Geusebroek and Cor J. Veenman and Cees G. M. Snoek},
      title = {Generic and Robust Scene Categorization by Learning Context},
      booktitle = {Proceedings of the 12th Annual Conference of the Advanced School for Computing and Imaging},
      pages = {},
      address = {Lommel, Belgium},
      month = {June},
      year = {2006},
      pdf = {}
    }
  12. Cees G. M. Snoek and Marcel Worring, "Time Interval based Modelling and Classification of Events in Soccer Video," in Proceedings of the 9th Annual Conference of the Advanced School for Computing and Imaging, Heijen, The Netherlands, 2003.
    @INPROCEEDINGS{SnoekASCI03,
      author = {Cees G. M. Snoek and Marcel Worring},
      title = {Time Interval based Modelling and Classification of Events in Soccer Video},
      booktitle = {Proceedings of the 9th Annual Conference of the Advanced School for Computing and Imaging},
      pages = {},
      address = {Heijen, The Netherlands},
      month = {June},
      year = {2003},
      pdf = {}
    }
  13. Cees G. M. Snoek and Marcel Worring, "A State-of-the-art Review on Multimodal Video Indexing," in Proceedings of the 8th Annual Conference of the Advanced School for Computing and Imaging, Lochem, The Netherlands, 2002, pp. 194-202.
    @INPROCEEDINGS{SnoekASCI02,
      author = {Cees G. M. Snoek and Marcel Worring},
      title = {A State-of-the-art Review on Multimodal Video Indexing},
      booktitle = {Proceedings of the 8th Annual Conference of the Advanced School for Computing and Imaging},
      pages = {194--202},
      address = {Lochem, The Netherlands},
      month = {June},
      year = {2002},
      pdf = {}
    }
This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Comments are closed.