Publications

Journal Papers

  1. Jianfeng Dong, Xirong Li, and Cees G. M. Snoek, "Predicting Visual Features from Text for Image and Video Caption Retrieval," IEEE Transactions on Multimedia, 2017.
    Submitted
    @ARTICLE{DongTEMP17,
      author = {Jianfeng Dong and Xirong Li and Cees G. M. Snoek},
      title = {Predicting Visual Features from Text for Image and Video Caption Retrieval},
      journal = {{IEEE} Transactions on Multimedia},
      pages = {},
      month = {},
      year = {2017},
      volume = {},
      number = {},
      pdf = {http://arxiv.org/abs/1604.06838},
      note = {Submitted},
      abstract = { }
    }
  2. Zhenyang Li, Kirill Gavrilyuk, Efstratios Gavves, Mihir Jain, and Cees G. M. Snoek, "VideoLSTM Convolves, Attends and Flows for Action Recognition," Computer Vision and Image Understanding, 2017.
    In press
    @ARTICLE{LiCVIU17,
      author = {Zhenyang Li and Kirill Gavrilyuk and Efstratios Gavves and Mihir Jain and Cees G. M. Snoek},
      title = {{VideoLSTM} Convolves, Attends and Flows for Action Recognition},
      journal = {Computer Vision and Image Understanding},
      pages = {},
      month = {},
      year = {2017},
      volume = {},
      number = {},
      pdf = {http://arxiv.org/abs/1607.01794},
      note = {In press},
      abstract = { }
    }
  3. Amirhossein Habibian, Thomas Mensink, and Cees G. M. Snoek, "Video2vec Embeddings Recognize Events when Examples are Scarce," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, iss. 10, pp. 2089-2103, 2017.
    @ARTICLE{HabibianPAMI17,
      author = {Amirhossein Habibian and Thomas Mensink and Cees G. M. Snoek},
      title = {{Video2vec} Embeddings Recognize Events when Examples are Scarce},
      journal = {{IEEE} Transactions on Pattern Analysis and Machine Intelligence},
      pages = {2089-2103},
      month = {October},
      year = {2017},
      volume = {39},
      number = {10},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/habibian-video2vec-pami.pdf},
      abstract = { This paper aims for event recognition when video examples are scarce or even completely absent. The key in such a challenging setting is a semantic video representation. Rather than building the representation from individual attribute detectors and their annotations, we propose to learn the entire representation from freely available web videos and their descriptions using an embedding between video features and term vectors. In our proposed embedding, which we call Video2vec, the correlations between the words are utilized to learn a more effective representation by optimizing a joint objective balancing descriptiveness and predictability. We show how learning the Video2vec embedding using a multimodal predictability loss, including appearance, motion and audio features, results in a better predictable representation. We also propose an event specific variant of Video2vec to learn a more accurate representation for the words, which are indicative of the event, by introducing a term sensitive descriptiveness loss. Our experiments on three challenging collections of web videos from the NIST TRECVID Multimedia Event Detection and Columbia Consumer Videos datasets demonstrate: i) the advantages of Video2vec over representations using attributes or alternative embeddings, ii) the benefit of fusing video modalities by an embedding over common strategies, iii) the complementarity of term sensitive descriptiveness and multimodal predictability for event recognition. By its ability to improve predictability of present day audio-visual video features, while at the same time maximizing their semantic descriptiveness, Video2vec leads to state-of-the-art accuracy for both few- and zero-example recognition of events in video. }
    }
  4. Jingkuan Song, Hervé Jégou, Cees Snoek, Qi Tian, and Nicu Sebe, "Guest Editorial: Large-Scale Multimedia Data Retrieval, Classification, and Understanding," IEEE Transactions on Multimedia, vol. 19, iss. 9, pp. 1965-1967, 2017.
    @ARTICLE{SongTMM17,
      author = {Jingkuan Song and Herv\'e J\'egou and Cees Snoek and Qi Tian and Nicu Sebe},
      title = {Guest Editorial: Large-Scale Multimedia Data Retrieval, Classification, and Understanding},
      journal = {{IEEE} Transactions on Multimedia},
      month = {September},
      year = {2017},
      volume = {19},
      number = {9},
      pages={1965-1967},
      pdf = {},
      abstract = { The papers in this special section focus on multimedia data retrieval and classification via large-scale systems. Today, large collections of multimedia data are explosively created in different fields and have attracted increasing interest in the multimedia research area. Large-scale multimedia data provide great unprecedented opportunities to address many challenging research problems, e.g., enabling generic visual classification to bridge the well-known semantic gap by exploring large-scale data, offering a promising possibility for in-depth multimedia understanding, as well as discerning patterns and making better decisions by analyzing the large pool of data. Therefore, the techniques for large-scale multimedia retrieval, classification, and understanding are highly desired. Simultaneously, the explosion of multimedia data puts urgent needs for more sophisticated and robust models and algorithms to retrieve, classify, and understand these data. Another interesting challenge is, how can the traditional machine learning algorithms be scaled up to millions and even billions of items with thousands of dimensionalities? This motivated the community to design parallel and distributed machine learning platforms, exploiting GPUs as well as developing practical algorithms. Besides, it is also important to exploit the commonalities and differences between different tasks, e.g., image retrieval and classification have much in common while different indexing methods evolve in a mutually supporting way. }
    }
  5. Mihir Jain, Jan C. van Gemert, Hervé Jégou, Patrick Bouthemy, and Cees G. M. Snoek, "Tubelets: Unsupervised Action Proposals from Spatiotemporal Super-voxels," International Journal of Computer Vision, vol. 124, iss. 3, pp. 287-311, 2017.
    @ARTICLE{JainIJCV17,
      author = {Mihir Jain and Jan C. van Gemert and Herv\'e J\'egou and Patrick Bouthemy and Cees G. M. Snoek},
      title = {Tubelets: Unsupervised Action Proposals from Spatiotemporal Super-voxels},
      journal = {International Journal of Computer Vision},
      pages = {287--311},
      month = {September},
      year = {2017},
      volume = {124},
      number = {3},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/jain-tubelets-ijcv.pdf},
      abstract = { This paper considers the problem of localizing actions in videos as sequences of bounding boxes. The objective is to generate action proposals that are likely to include the action of interest, ideally achieving high recall with few proposals. Our contributions are threefold. First, inspired by selective search for object proposals, we introduce an approach to generate action proposals from spatiotemporal super-voxels in an unsupervised manner, we call them Tubelets. Second, along with the static features from individual frames our approach advantageously exploits motion. We introduce independent motion evidence as a feature to characterize how the action deviates from the background and explicitly incorporate such motion information in various stages of the proposal generation. Finally, we introduce spatiotemporal refinement of Tubelets, for more precise localization of actions, and pruning to keep the number of Tubelets limited. We demonstrate the suitability of our approach by extensive experiments for action proposal quality and action localization on three public datasets: UCF Sports, MSR-II and UCF101. For action proposal quality, our unsupervised proposals beat all other existing approaches on the three datasets. For action localization, we show top performance on both the trimmed videos of UCF Sports and UCF101 as well as the untrimmed videos of MSR-II. }
    }
  6. Pascal Mettes, Jan C. van Gemert, and Cees G. M. Snoek, "No Spare Parts: Sharing Part Detectors for Image Categorization," Computer Vision and Image Understanding, vol. 152, pp. 131-141, 2016.
    @ARTICLE{MettesCVIU16,
      author = {Pascal Mettes and Jan C. van Gemert and Cees G. M. Snoek},
      title = {No Spare Parts: Sharing Part Detectors for Image Categorization},
      journal = {Computer Vision and Image Understanding},
      pages = {131--141},
      month = {November},
      year = {2016},
      volume = {152},
      number = {},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mettes-spare-parts-cviu.pdf},
      abstract = { This work aims for image categorization by learning a representation of discriminative parts. Different from most existing part-based methods, we argue that parts are naturally shared between image categories and should be modeled as such. We motivate our approach with a quantitative and qualitative analysis by backtracking where selected parts come from. Our analysis shows that in addition to the category parts defining the category, the parts coming from the background context and parts from other image categories improve categorization performance. Part selection should not be done separately for each category, but instead be shared and optimized over all categories. To incorporate part sharing between categories, we present an algorithm based on AdaBoost to optimize part sharing and selection, as well as fusion with the global image representation. With a single algorithm and without the need for task-specific optimization, we achieve results competitive to the state-of-the-art on object, scene, and action categories, further improving over deep convolutional neural networks and alternative part representations. }
    }
  7. George Awad, Cees G. M. Snoek, Alan F. Smeaton, and Georges Quénot, "TRECVid Semantic Indexing of Video: A 6-year Retrospective," ITE Transactions on Media Technology and Applications, vol. 4, iss. 3, pp. 187-208, 2016.
    ITE Niwa-Takayanagi Award
    @ARTICLE{AwadTMTA16,
      author = {George Awad and Cees G. M. Snoek and Alan F. Smeaton and Georges Qu\'enot},
      title = {TRECVid Semantic Indexing of Video: A 6-year Retrospective},
      journal = {ITE Transactions on Media Technology and Applications},
      pages = {187--208},
      month = {},
      year = {2016},
      volume = {4},
      number = {3},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/awad-trecvid-retrospective-ite.pdf},
      note = {ITE Niwa-Takayanagi Award},
      abstract = { Semantic indexing, or assigning semantic tags to video samples, is a key component for content-based access to video documents and collections. The Semantic Indexing task has been run at TRECVid from 2010 to 2015 with the support of NIST and the Quaero project. As with the previous High-Level Feature detection task which ran from 2002 to 2009, the semantic indexing task aims at evaluating methods and systems for detecting visual, auditory or multi-modal concepts in video shots. In addition to the main semantic indexing task, four secondary tasks were proposed namely the ``localization'' task, the ``concept pair'' task, the ``no annotation'' task, and the ``progress'' task. It attracted over 40 research teams during its running period. The task was conducted using a total of 1,400 hours of video data drawn from Internet Archive videos with Creative Commons licenses gathered by NIST. 200 hours of new test data was made available each year plus 200 more as development data in 2010. The number of target concepts to be detected started from 130 in 2010 and was extended to 346 in 2011. Both the increase in the volume of video data and in the number of target concepts favored the development of generic and scalable methods. Over 8 millions shots$\times$concepts direct annotations plus over 20 millions indirect ones were produced by the participants and the Quaero project on a total of 800 hours of development data. Significant progress was accomplished during the period as this was accurately measured in the context of the progress task but also from some of the participants' contrast experiments. This paper describes the data, protocol and metrics used for the main and the secondary tasks, the results obtained and the main approaches used by participants. }
    }
  8. Masoud Mazloom, Xirong Li, and Cees G. M. Snoek, "TagBook: A Semantic Video Representation without Supervision for Event Detection," IEEE Transactions on Multimedia, vol. 18, iss. 7, pp. 1378-1388, 2016.
    @ARTICLE{MazloomTMM16,
      author = {Masoud Mazloom and Xirong Li and Cees G. M. Snoek},
      title = {{TagBook}: A Semantic Video Representation without Supervision for Event Detection},
      journal = {{IEEE} Transactions on Multimedia},
      pages = {1378--1388},
      month = {July},
      year = {2016},
      volume = {18},
      number = {7},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mazloom-tagbook-tmm.pdf},
      abstract = { We consider the problem of event detection in video for scenarios where only few, or even zero examples are available for training. For this challenging setting, the prevailing solutions in the literature rely on a semantic video representation obtained from thousands of pre-trained concept detectors. Different from existing work, we propose a new semantic video representation that is based on freely available social tagged videos only, without the need for training any intermediate concept detectors. We introduce a simple algorithm that propagates tags from a video's nearest neighbors, similar in spirit to the ones used for image retrieval, but redesign it for video event detection by including video source set refinement and varying the video tag assignment. We call our approach TagBook and study its construction, descriptiveness and detection performance on the TRECVID 2013 and 2014 multimedia event detection datasets and the Columbia Consumer Video dataset. Despite its simple nature, the proposed TagBook video representation is remarkably effective for few-example and zero-example event detection, even outperforming very recent state-of-the-art alternatives building on supervised representations. }
    }
  9. Xirong Li, Tiberio Uricchio, Lamberto Ballan, Marco Bertini, Cees G. M. Snoek, and Alberto Del Bimbo, "Socializing the Semantic Gap: A Comparative Survey on Image Tag Assignment, Refinement and Retrieval," ACM Computing Surveys, vol. 49, iss. 1, p. 14:1-39, 2016.
    @ARTICLE{LiCSUR16,
      author = {Xirong Li and Tiberio Uricchio and Lamberto Ballan and Marco Bertini and Cees G. M. Snoek and Alberto Del Bimbo},
      title = {Socializing the Semantic Gap: A Comparative Survey on Image Tag Assignment, Refinement and Retrieval},
      journal = {ACM Computing Surveys},
      pages = {14:1--39},
      month = {June},
      year = {2016},
      volume = {49},
      number = {1},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-survey-csur.pdf},
      software = {https://github.com/li-xirong/jingwei},
      abstract = { Where previous reviews on content-based image retrieval emphasize what can be seen in an image to bridge the semantic gap, this survey considers what people tag about an image. A comprehensive treatise of three closely linked problems (i.e., image tag assignment, refinement, and tag-based image retrieval) is presented. While existing works vary in terms of their targeted tasks and methodology, they rely on the key functionality of tag relevance, that is, estimating the relevance of a specific tag with respect to the visual content of a given image and its social context. By analyzing what information a specific method exploits to construct its tag relevance function and how such information is exploited, this article introduces a two-dimensional taxonomy to structure the growing literature, understand the ingredients of the main works, clarify their connections and difference, and recognize their merits and limitations. For a head-to-head comparison with the state of the art, a new experimental protocol is presented, with training sets containing 10,000, 100,000, and 1 million images, and an evaluation on three test sets, contributed by various research groups. Eleven representative works are implemented and evaluated. Putting all this together, the survey aims to provide an overview of the past and foster progress for the near future. }
    }
  10. Henri Bal, Dick Epema, Cees de Laat, Rob van Nieuwpoort, John Romein, Frank Seinstra, Cees Snoek, and Harry Wijshoff, "A Medium-Scale Distributed System for Computer Science Research: Infrastructure for the Long Term," IEEE Computer, vol. 49, iss. 5, pp. 54-63, 2016.
    @ARTICLE{BalCOM16,
      author = {Henri Bal and Dick Epema and Cees de Laat and Rob van Nieuwpoort and John Romein and Frank Seinstra and Cees Snoek and Harry Wijshoff},
      title = {A Medium-Scale Distributed System for Computer Science Research: Infrastructure for the Long Term},
      journal = {{IEEE} Computer},
      pages = {54--63},
      month = {May},
      year = {2016},
      volume = {49},
      number = {5},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/bal-das-computer.pdf},
      abstract = { The Dutch Advanced School for Computing and Imaging has built five generations of a 200-node distributed system over nearly two decades while remaining aligned with the shifting computer science research agenda. The system has supported years of awardwinning research, underlining the benefits of investing in a smaller-scale, tailored design. }
    }
  11. Luming Zhang, Rongrong Ji, Zhen Yi, Weisi Lin, and Cees G. M. Snoek, "Special issue on weakly supervised learning," Journal of Visual Communication and Image Representation, vol. 37, pp. 1-2, 2016.
    @ARTICLE{ZhangJVCIR16,
      author = {Luming Zhang and Rongrong Ji and Zhen Yi and Weisi Lin and Cees G. M. Snoek},
      title = {Special issue on weakly supervised learning},
      journal = {Journal of Visual Communication and Image Representation},
      pages = {1--2},
      month = {May},
      year = {2016},
      volume = {37},
      number = {},
      pdf = {},
      abstract = { }
    }
  12. Jitao Sang, Yue Gao, Bing-kun Bao, Cees G. M. Snoek, and Qionghai Dai, "Recent advances in social multimedia big data mining and applications," Multimedia Systems, vol. 22, iss. 1, pp. 1-3, 2016.
    @ARTICLE{SangMS16,
      author = {Jitao Sang and Yue Gao and Bing-kun Bao and Cees G. M. Snoek and Qionghai Dai},
      title = {Recent advances in social multimedia big data mining and applications},
      journal = {Multimedia Systems},
      pages = {1--3},
      month = {February},
      year = {2016},
      volume = {22},
      number = {1},
      pdf = {},
      abstract = { In the past decade, social media contributes significantly to the arrival of the Big Data era. Big Data has not only provided new solutions for social media mining and applications, but brought about a paradigm shift to many fields of data analytics. This special issue solicits recent related attempts in the multimedia community. We believe that the enclosed papers in this special issue provide a unique opportunity for multidisciplinary works connecting both the social media and big data contexts to multimedia computing. }
    }
  13. Meng Wang, Ke Lu, Gang Hua, and Cees G. M. Snoek, "Guest editorial: selected papers from ICIMCS 2013," Multimedia Systems, vol. 21, iss. 2, pp. 131-132, 2015.
    @ARTICLE{WangMS15,
      author = {Meng Wang and Ke Lu and Gang Hua and Cees G. M. Snoek},
      title = {Guest editorial: selected papers from ICIMCS 2013},
      journal = {Multimedia Systems},
      pages = {131--132},
      month = {March},
      year = {2015},
      volume = {21},
      number = {2},
      pdf = {},
      abstract = { }
    }
  14. Svetlana Kordumova, Xirong Li, and Cees G. M. Snoek, "Best Practices for Learning Video Concept Detectors from Social Media Examples," Multimedia Tools and Applications, vol. 74, iss. 4, pp. 1291-1315, 2015.
    @ARTICLE{KordumovaMMTA15,
      author = {Svetlana Kordumova and Xirong Li and Cees G. M. Snoek},
      title = {Best Practices for Learning Video Concept Detectors from Social Media Examples},
      journal = {Multimedia Tools and Applications},
      pages = {1291--1315},
      month = {February},
      year = {2015},
      volume = {74},
      number = {4},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/kordumova-practices-mmta.pdf},
      abstract = { Learning video concept detectors from social media sources, such as Flickr images and YouTube videos, has the potential to address a wide variety of concept queries for video search. While the potential has been recognized by many, and progress on the topic has been impressive, we argue that key questions crucial to know how to learn effective video concept detectors from social media examples? remain open. As an initial attempt to answer these questions, we conduct an experimental study using a video search engine which is capable of learning concept detectors from social media examples, be it socially tagged videos or socially tagged images. Within the video search engine we investigate three strategies for positive example selection, three negative example selection strategies and three learning strategies. The performance is evaluated on the challenging TRECVID 2012 benchmark consisting of 600 h of Internet video. From the experiments we derive four best practices: (1) tagged images are a better source for learning video concepts than tagged videos, (2) selecting tag relevant positive training examples is always beneficial, (3) selecting relevant negative examples is advantageous and should be treated differently for video and image sources, and (4) learning concept detectors with selected relevant training data before learning is better then incorporating the relevance during the learning process. The best practices within our video search engine lead to state-of-the-art performance in the TRECVID 2013 benchmark for concept detection without manually provided annotations. }
    }
  15. Efstratios Gavves, Basura Fernando, Cees G. M. Snoek, Arnold W. M. Smeulders, and Tinne Tuytelaars, "Local Alignments for Fine-Grained Categorization," International Journal of Computer Vision, vol. 111, iss. 2, pp. 191-212, 2015.
    @ARTICLE{GavvesIJCV15,
      author = {Efstratios Gavves and Basura Fernando and Cees G. M. Snoek and Arnold W. M. Smeulders and Tinne Tuytelaars},
      title = {Local Alignments for Fine-Grained Categorization},
      journal = {International Journal of Computer Vision},
      pages = {191--212},
      month = {January},
      year = {2015},
      volume = {111},
      number = {2},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/gavves-finegrained-ijcv.pdf},
      abstract = { The aim of this paper is fine-grained categorization without human interaction. Different from prior work, which relies on detectors for specific object parts, we propose to localize distinctive details by roughly aligning the objects using just the overall shape. Then, one may proceed to the differential classification by examining the corresponding regions of the alignments. More specifically, the alignments are used to transfer part annotations from training images to unseen images (supervised alignment), or to blindly yet consistently segment the object in a number of regions (unsupervised alignment). We further argue that for the distinction of sub-classes, distribution-based features like color Fisher vectors are better suited for describing localized appearance of fine-grained categories than popular matching oriented intensity features, like HOG. They allow capturing the subtle local differences between subclasses, while at the same time being robust to misalignments between distinctive details. We evaluate the local alignments on the CUB-2011 and on the Stanford Dogs datasets, composed of 200 and 120, visually very hard to distinguish bird and dog species. In our experiments we study and show the benefit of the color Fisher vector parameterization, the influence of the alignment partitioning, and the significance of object segmentation on fine-grained categorization. We, furthermore, show that by using object detectors as voters to generate object confidence saliency maps, we arrive at fully unsupervised, yet highly accurate fine-grained categorization. The proposed local alignments set a new state-of-the-art on both the fine-grained birds and dogs datasets, even without any human intervention. What is more, the local alignments reveal what appearance details are most decisive per fine-grained object category. }
    }
  16. Masoud Mazloom, Efstrastios Gavves, and Cees G. M. Snoek, "Conceptlets: Selective Semantics for Classifying Video Events," IEEE Transactions on Multimedia, vol. 16, iss. 8, pp. 2214-2228, 2014.
    @ARTICLE{MazloomTMM14,
      author = {Masoud Mazloom and Efstrastios Gavves and Cees G. M. Snoek},
      title = {Conceptlets: Selective Semantics for Classifying Video Events},
      journal = {{IEEE} Transactions on Multimedia},
      pages = {2214--2228},
      month = {December},
      year = {2014},
      volume = {16},
      number = {8},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mazloom-conceptlets-tmm.pdf},
      abstract = { An emerging trend in video event classification is to learn an event from a bank of concept detector scores. Different from existing work, which simply relies on a bank containing all available detectors, we propose in this paper an algorithm that learns from examples what concepts in a bank are most informative per event, which we call the conceptlet. We model finding the conceptlet out of a large set of concept detectors as an importance sampling problem. Our proposed approximate algorithm finds the optimal conceptlet using a cross-entropy optimization. We study the behavior of video event classification based on conceptlets by performing four experiments on challenging internet video from the 2010 and 2012 TRECVID multimedia event detection tasks and Columbia's consumer video dataset. Starting from a concept bank of more than thousand precomputed detectors, our experiments establish there are (sets of) individual concept detectors that are more discriminative and appear to be more descriptive for a particular event than others, event classification using an automatically obtained conceptlet is more robust than using all available concepts, and conceptlets obtained with our cross-entropy algorithm are better than conceptlets from state-of-the-art feature selection algorithms. What is more, the conceptlets make sense for the events of interest, without being programmed to do so. }
    }
  17. Amirhossein Habibian and Cees G. M. Snoek, "Recommendations for Recognizing Video Events by Concept Vocabularies," Computer Vision and Image Understanding, vol. 124, pp. 110-122, 2014.
    @ARTICLE{HabibianCVIU14,
      author = {Amirhossein Habibian and Cees G. M. Snoek},
      title = {Recommendations for Recognizing Video Events by Concept Vocabularies},
      journal = {Computer Vision and Image Understanding},
      pages = {110--122},
      month = {July},
      year = {2014},
      volume = {124},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/habibian-recommendations-cviu.pdf},
      abstract = { Representing videos using vocabularies composed of concept detectors appears promising for generic event recognition. While many have recently shown the benefits of concept vocabularies for recognition, studying the characteristics of a universal concept vocabulary suited for representing events is ignored. In this paper, we study how to create an effective vocabulary for arbitrary-event recognition in web video. We consider five research questions related to the number, the type, the specificity, the quality and the normalization of the detectors in concept vocabularies. A rigorous experimental protocol using a pool of 1346 concept detectors trained on publicly available annotations, two large arbitrary web video datasets and a common event recognition pipeline allow us to analyze the performance of various concept vocabulary definitions. From the analysis we arrive at the recommendation that for effective event recognition the concept vocabulary should (i) contain more than 200 concepts, (ii) be diverse by covering object, action, scene, people, animal and attribute concepts, (iii) include both general and specific concepts, (iv) increase the number of concepts rather than improve the quality of the individual detectors, and (v) contain detectors that are appropriately normalized. We consider the recommendations for recognizing video events by concept vocabularies the most important contribution of the paper, as they provide guidelines for future work. }
    }
  18. Alberto Del Bimbo, Selcuk K. Candan, Yu-Gang Jiang, Jiebo Luo, Tao Mei, Nicu Sebe, Han Tao Shen, Cees G. M. Snoek, and Rong Yan, "Special Section on Socio-Mobile Media Analysis and Retrieval," IEEE Transactions on Multimedia, vol. 16, iss. 3, pp. 586-587, 2014.
    @ARTICLE{DelBimboTMM2014,
      author = {Alberto Del Bimbo and K. Selcuk Candan and Yu-Gang Jiang and Jiebo Luo and Tao Mei and Nicu Sebe and Han Tao Shen and Cees G. M. Snoek and Rong Yan},
      title = {Special Section on Socio-Mobile Media Analysis and Retrieval},
      journal = {{IEEE} Transactions on Multimedia},
      month = {April},
      year = {2014},
      volume = {16},
      number = {3},
      pages = {586--587},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/delbimbo-socio-mobile-tmm.pdf},
      abstract = { }
    }
  19. Lexing Xie, Ayman Shamma, and Cees G. M. Snoek, "Content is Dead…Long Live Content: The New Age of Multimedia–Hard Problems," IEEE Multimedia, vol. 21, iss. 1, pp. 4-8, 2014.
    @ARTICLE{XieMM14,
      author = {Lexing Xie and Ayman Shamma and Cees G. M. Snoek},
      title = {Content is Dead...Long Live Content: The New Age of Multimedia--Hard Problems},
      journal = {{IEEE} Multimedia},
      pages = {4--8},
      month = {January--March},
      year = {2014},
      volume = {21},
      number = {1},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/xie-content-is-dead-mm.pdf},
      abstract = { Using the ACM Multimedia 2012 panel on metadata as a jumping-off point, the authors investigate whether content can continue to play a dominant role in multimedia research in the age of social, local, and mobile media. In this article, they propose that the community now must face the challenge of characterizing the level of difficulty of multimedia problems to establish a better understanding of where content analysis needs further improvement. They also suggest a classification method that defines problem complexity in the context of human-assisted computation. }
    }
  20. Gregory K. Myers, Ramesh Nallapati, Julien van Hout, Stephanie Pancoast, Ram Nevatia, Chen Sun, Amirhossein Habibian, Dennis C. Koelma, Koen E. A. van de Sande, Arnold W. M. Smeulders, and Cees G. M. Snoek, "Evaluating Multimedia Features and Fusion for Example-based Event Detection," Machine Vision and Applications, vol. 25, iss. 1, pp. 17-32, 2014.
    @ARTICLE{MyersMVA14,
      author = {Gregory K. Myers and Ramesh Nallapati and Julien {van Hout} and Stephanie Pancoast and Ram Nevatia and Chen Sun and Amirhossein Habibian and Dennis C. Koelma and Koen E. A. van de Sande and Arnold W. M. Smeulders and Cees G. M. Snoek},
      title = {Evaluating Multimedia Features and Fusion for Example-based Event Detection},
      journal = {Machine Vision and Applications},
      pages = {17-32},
      month = {January},
      year = {2014},
      volume = {25},
      number = {1},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/myers-features-fusion-events-mva.pdf},
      abstract = { Multimedia event detection (MED)is a challenging problem because of the heterogeneous content and variable quality found in large collections of Internet videos. To study the value of multimedia features and fusion for representing and learning events from a set of example video clips, we created SESAME, a system for video SEarch with Speed and Accuracy for Multimedia Events. SESAME includes multiple bag-of-words event classifiers based on single data types: low-level visual, motion, and audio features; high-level semantic visual concepts; and automatic speech recognition. Event detection performance was evaluated for each event classifier. The performance of low-level visual and motion features was improved by the use of difference coding. The accuracy of the visual concepts was nearly as strong as that of the low-level visual features. Experiments with a number of fusion methods for combining the event detection scores from these classifiers revealed that simple fusion methods, such as arithmetic mean, perform as well as or better than other, more complex fusion methods. SESAME's performance in the 2012 TRECVID MED evaluation was one of the best reported. }
    }
  21. Yi Yang, Nicu Sebe, Cees G. M. Snoek, Xian-Sheng Hua, and Yueting Zhuang, "Special Section on Learning from Multiple Evidences for Large Scale Multimedia Analysis," Computer Vision and Image Understanding, vol. 118, p. 1, 2014.
    @ARTICLE{YangCVIU14,
      author = {Yi Yang and Nicu Sebe and Cees G. M. Snoek and Xian-Sheng Hua and Yueting Zhuang},
      title = {Special Section on Learning from Multiple Evidences for Large Scale Multimedia Analysis},
      journal = {Computer Vision and Image Understanding},
      pages = {1},
      month = {January},
      year = {2014},
      volume = {118},
      number = {},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/yang-learning-multimedia-evidence-cviu-si.pdf}
    }
  22. Xirong Li, Cees G. M. Snoek, Marcel Worring, Dennis C. Koelma, and Arnold W. M. Smeulders, "Bootstrapping Visual Categorization with Relevant Negatives," IEEE Transactions on Multimedia, vol. 15, iss. 4, pp. 933-945, 2013.
    @ARTICLE{LiTMM13,
      author = {Xirong Li and Cees G. M. Snoek and Marcel Worring and Dennis C. Koelma and Arnold W. M. Smeulders},
      title = {Bootstrapping Visual Categorization with Relevant Negatives},
      journal = {{IEEE} Transactions on Multimedia},
      pages = {933--945},
      month = {June},
      year = {2013},
      volume = {15},
      number = {4},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-negative-tmm.pdf},
      abstract = { Learning classifiers for many visual concepts are important for image categorization and retrieval. As a classifier tends to misclassify negative examples which are visually similar to positive ones, inclusion of such misclassified and thus relevant negatives should be stressed during learning. User-tagged images are abundant online, but which images are the relevant negatives remains unclear. Sampling negatives at random is the de facto standard in the literature. In this paper, we go beyond random sampling by proposing Negative Bootstrap. Given a visual concept and a few positive examples, the new algorithm iteratively finds relevant negatives. Per iteration, we learn from a small proportion of many user-tagged images, yielding an ensemble of meta classifiers. For efficient classification, we introduce Model Compression such that the classification time is independent of the ensemble size. Compared with the state of the art, we obtain relative gains of 14\% and 18\% on two present-day benchmarks in terms of mean average precision. For concept search in one million images, model compression reduces the search time from over 20 h to approximately 6 min. The effectiveness and efficiency, without the need of manually labeling any negatives, make negative bootstrap appealing for learning better visual concept classifiers. }
    }
  23. Bouke Huurnink, Cees G. M. Snoek, Maarten de Rijke, and Arnold W. M. Smeulders, "Content-Based Analysis Improves Audiovisual Archive Retrieval," IEEE Transactions on Multimedia, vol. 14, iss. 4, pp. 1166-1178, 2012.
    @ARTICLE{HuurninkTMM12,
      author = {Bouke Huurnink and Cees G. M. Snoek and Maarten {de Rijke} and Arnold W. M. Smeulders},
      title = {Content-Based Analysis Improves Audiovisual Archive Retrieval},
      journal = {{IEEE} Transactions on Multimedia},
      pages = {1166--1178},
      month = {August},
      year = {2012},
      volume = {14},
      number = {4},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/huurnink-archive-tmm.pdf},
      data = {http://ilps.science.uva.nl/sites/default/files/audiovisual_archive_resources_v2.0_0.zip},
      abstract = { Content-based video retrieval is maturing to the point where it can be used in real-world retrieval practices. One such practice is the audiovisual archive, whose users increasingly require fine-grained access to broadcast television content. In this paper, we take into account the information needs and retrieval data already present in the audiovisual archive, and demonstrate that retrieval performance can be significantly improved when content-based methods are applied to search. To the best of our knowledge, this is the first time that the practice of an audiovisual archive has been taken into account for quantitative retrieval evaluation. To arrive at our main result, we propose an evaluation methodology tailored to the specific needs and circumstances of the audiovisual archive, which are typically missed by existing evaluation initiatives. We utilize logged searches, content purchases, session information, and simulators to create realistic query sets and relevance judgments. To reflect the retrieval practice of both the archive and the video retrieval community as closely as possible, our experiments with three video search engines incorporate archive-created catalog entries as well as state-of-the-art multimedia content analysis results. A detailed query-level analysis indicates that individual content-based retrieval methods such as transcript-based retrieval and concept-based retrieval yield approximately equal performance gains. When combined, we find that content-based video retrieval incorporated into the archives practice results in significant performance increases for shot retrieval and for retrieving entire television programs. The time has come for audiovisual archives to start accommodating content-based video retrieval methods into their daily practice. }
    }
  24. Xirong Li, Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders, "Harvesting Social Images for Bi-Concept Search," IEEE Transactions on Multimedia, vol. 14, iss. 4, pp. 1091-1104, 2012.
    @ARTICLE{LiTMM12,
      author = {Xirong Li and Cees G. M. Snoek and Marcel Worring and Arnold W. M. Smeulders},
      title = {Harvesting Social Images for Bi-Concept Search},
      journal = {{IEEE} Transactions on Multimedia},
      pages = {1091--1104},
      month = {August},
      year = {2012},
      volume = {14},
      number = {4},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-biconcept-tmm.pdf},
      abstract = { Searching for the co-occurrence of two visual concepts in unlabeled images is an important step towards answering complex user queries. Traditional visual search methods use combinations of the confidence scores of individual concept detectors to tackle such queries. In this paper we introduce the notion of bi-concepts, a new concept-based retrieval method that is directly learned from social-tagged images. As the number of potential bi-concepts is gigantic, manually collecting training examples is infeasible. Instead, we propose a multimedia framework to collect de-noised positive as well as informative negative training examples from the social web, to learn bi-concept detectors from these examples, and to apply them in a search engine for retrieving bi-concepts in unlabeled images. We study the behavior of our bi-concept search engine using 1.2M social-tagged images as a data source. Our experiments indicate that harvesting examples for bi-concepts differs from traditional single-concept methods, yet the examples can be collected with high accuracy using a multi-modal approach. We find that directly learning bi-concepts is better than oracle linear fusion of single-concept detectors, with a relative improvement of 100\%. This study reveals the potential of learning high-order semantics from social images, for free, suggesting promising new lines of research. }
    }
  25. Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders, "Visual Synonyms for Landmark Image Retrieval," Computer Vision and Image Understanding, vol. 116, iss. 2, pp. 238-249, 2012.
    @ARTICLE{GavvesCVIU12,
      author = {Efstratios Gavves and Cees G. M. Snoek and Arnold W. M. Smeulders},
      title = {Visual Synonyms for Landmark Image Retrieval},
      journal = {Computer Vision and Image Understanding},
      pages = {238--249},
      month = {February},
      year = {2012},
      volume = {116},
      number = {2},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/gavves-synonyms-cviu.pdf},
      abstract = { In this paper, we address the incoherence problem of the visual words in bag-of-words vocabularies. Different from existing work, which assigns words based on closeness in descriptor space, we focus on identifying pairs of independent, distant words -- the visual synonyms -- that are likely to host image patches of similar visual reality. We focus on landmark images, where the image geometry guides the detection of synonym pairs. Image geometry is used to find those image features that lie in the nearly identical physical location, yet are assigned to different words of the visual vocabulary. Defined in this way, we evaluate the validity of visual synonyms. We also examine the closeness of synonyms in the L2-normalized feature space. We show that visual synonyms may successfully be used for vocabulary reduction. Furthermore, we show that combining the reduced visual vocabularies with synonym augmentation, we perform on par with the state-of-the-art bag-of-words approach, while having a 98\% smaller vocabulary. }
    }
  26. Jeroen Steggink and Cees G. M. Snoek, "Adding Semantics to Image-Region Annotations with the Name-It-Game," Multimedia Systems, vol. 17, iss. 5, pp. 367-378, 2011.
    @ARTICLE{StegginkMS11,
      author = {Jeroen Steggink and Cees G. M. Snoek},
      title = {Adding Semantics to Image-Region Annotations with the Name-It-Game},
      journal = {Multimedia Systems},
      pages = {367--378},
      month = {October},
      year = {2011},
      volume = {17},
      number = {5},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/steggink-name-it-game-mmsys.pdf},
      abstract = { In this paper we present the Name-It-Game, an interactive multimedia game fostering the swift creation of a large data set of region-based image annotations. Compared to existing annotation games, we consider an added semantic structure, by means of the WordNet ontology, the main innovation of the Name-It-Game. Using an ontology-powered game, instead of the more traditional annotation tools, potentially makes region-based image labeling more fun and accessible for every type of user. However, the current games often present the players with hard-to-guess objects. To prevent this from happening in the Name-It-Game, we successfully identify WordNet categories which filter out hard-to-guess objects. To verify the speed of the annotation process, we compare the online Name-It-Game with a desktop tool with similar features. Results show that the Name-It-Game outperforms this tool for semantic region-based image labeling. Lastly, we measure the accuracy of the produced segmentations and compare them with carefully created LabelMe segmentations. Judging from the quantitative and qualitative results, we believe the segmentations are competitive to those of LabelMe, especially when averaged over multiple games. By adding semantics to region-based image annotations, using the Name-It-Game, we have opened up an efficient means to provide precious labels in a playful manner. }
    }
  27. Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek, "Empowering Visual Categorization with the GPU," IEEE Transactions on Multimedia, vol. 13, iss. 1, pp. 60-70, 2011.
    @ARTICLE{SandeTMM11,
      author = {Koen E. A. van de Sande and Theo Gevers and Cees G. M. Snoek},
      title = {Empowering Visual Categorization with the {GPU}},
      journal = {{IEEE} Transactions on Multimedia},
      pages = {60--70},
      month = {February},
      year = {2011},
      volume = {13},
      number = {1},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/sande-categorization-gpu-tmm.pdf},
      software = {http://www.colordescriptors.com},
      abstract = { Visual categorization is important to manage large collections of digital images and video, where textual meta-data is often incomplete or simply unavailable. The bag-of-words model has become the most powerful method for visual categorization of images and video. Despite its high accuracy, a severe drawback of this model is its high computational cost. As the trend to increase computational power in newer CPU and GPU architectures is to increase their level of parallelism, exploiting this parallelism becomes an important direction to handle the computational cost of the bag-of-words approach. When optimizing a system based on the bag-of-words approach, the goal is to minimize the time it takes to process batches of images. Additionally, we also consider power usage as an evaluation metric. In this paper, we analyze the bag-of-words model for visual categorization in terms of computational cost and identify two major bottlenecks: the quantization step and the classification step. We address these two bottlenecks by proposing two efficient algorithms for quantization and classification by exploiting the GPU hardware and the CUDA parallel programming model. The algorithms are designed to (1) keep categorization accuracy intact, (2) decompose the problem and (3) give the same numerical results. In the experiments on large scale datasets it is shown that, by using a parallel implementation on the Geforce GTX260 GPU, classifying unseen images is 4.8 times faster than a quad-core CPU version on the Core i7 920, while giving the exact same numerical results. In addition, we show how the algorithms can be generalized to other applications, such as text retrieval and video retrieval. Moreover, when the obtained speedup is used to process extra video frames in a video retrieval benchmark, the accuracy of visual categorization is improved by 29\%. }
    }
  28. Cees G. M. Snoek and Malcolm Slaney, "Academia Meets Industry at the Multimedia Grand Challenge," IEEE Multimedia, vol. 18, iss. 1, pp. 4-7, 2011.
    @ARTICLE{SnoekMM11,
      author = {Cees G. M. Snoek and Malcolm Slaney},
      title = {Academia Meets Industry at the Multimedia Grand Challenge},
      journal = {{IEEE} Multimedia},
      pages = {4--7},
      month = {January--March},
      year = {2011},
      volume = {18},
      number = {1},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-grandchallenge-mm.pdf},
      abstract = { This column is about last year's ACM Multimedia Grand Challenge in Florence, Italy, an event that endeavors to connect (academic) researchers more effectively with the realities of the business world. The authors describe the 10 challenges and present the three winning applications. }
    }
  29. Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek, "Evaluating Color Descriptors for Object and Scene Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, iss. 9, pp. 1582-1596, 2010.
    @ARTICLE{SandePAMI10,
      author = {Koen E. A. van de Sande and Theo Gevers and Cees G. M. Snoek},
      title = {Evaluating Color Descriptors for Object and Scene Recognition},
      journal = {{IEEE} Transactions on Pattern Analysis and Machine Intelligence},
      pages = {1582--1596},
      month = {September},
      year = {2010},
      volume = {32},
      number = {9},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/sande-colordescriptors-pami.pdf},
      software = {http://www.colordescriptors.com},
      abstract = { Image category recognition is important to access visual information on the level of objects and scene types. So far, intensity-based descriptors have been widely used for feature extraction at salient points. To increase illumination invariance and discriminative power, color descriptors have been proposed. Because many different descriptors exist, a structured overview is required of color invariant descriptors in the context of image category recognition. Therefore, this paper studies the invariance properties and the distinctiveness of color descriptors in a structured way. The analytical invariance properties of color descriptors are explored, using a taxonomy based on invariance properties with respect to photometric transformations, and tested experimentally using a dataset with known illumination conditions. In addition, the distinctiveness of color descriptors is assessed experimentally using two benchmarks, one from the image domain and one from the video domain. From the theoretical and experimental results, it can be derived that invariance to light intensity changes and light color changes affects category recognition. The results reveal further that, for light intensity changes, the usefulness of invariance is category-specific. Overall, when choosing a single descriptor and no prior knowledge about the dataset and object and scene categories is available, the OpponentSIFT is recommended. Furthermore, a combined set of color descriptors outperforms intensity-based SIFT and improves category recognition by 8\% on the PASCAL VOC 2007 and by 7\% on the MediaMill Challenge. }
    }
  30. Daragh Byrne, Aiden R. Doherty, Cees G. M. Snoek, Gareth J. F. Jones, and Alan F. Smeaton, "Everyday Concept Detection in Visual Lifelogs: Validation, Relationships and Trends," Multimedia Tools and Applications, vol. 49, iss. 1, pp. 119-144, 2010.
    @ARTICLE{ByrneMMTA10,
      author = {Daragh Byrne and Aiden R. Doherty and Cees G. M. Snoek and Gareth J. F. Jones and Alan F. Smeaton},
      title = {Everyday Concept Detection in Visual Lifelogs: Validation, Relationships and Trends},
      journal = {Multimedia Tools and Applications},
      pages = {119--144},
      month = {August},
      year = {2010},
      volume = {49},
      number = {1},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/byrne-everyday-concept-detection-mmta.pdf},
      abstract = { The Microsoft SenseCam is a small lightweight wearable camera used to passively capture photos and other sensor readings from a user's day-to-day activities. It captures on average 3,000 images in a typical day, equating to almost 1 million images per year. It can be used to aid memory by creating a personal multimedia lifelog, or visual recording of the wearer's life. However the sheer volume of image data captured within a visual lifelog creates a number of challenges, particularly for locating relevant content. Within this work, we explore the applicability of semantic concept detection, a method often used within video retrieval, on the domain of visual lifelogs. Our concept detector models the correspondence between low-level visual features and high-level semantic concepts (such as indoors, outdoors, people, buildings, etc.) using supervised machine learning. By doing so it determines the probability of a concept's presence. We apply detection of 27 everyday semantic concepts on a lifelog collection composed of 257,518 SenseCam images from 5 users. The results were evaluated on a subset of 95,907 images, to determine the accuracy for detection of each semantic concept. We conducted further analysis on the temporal consistency, co-occurance and relationships within the detected concepts to more extensively investigate the robustness of the detectors within this domain. }
    }
  31. Cees G. M. Snoek and Arnold W. M. Smeulders, "Visual-Concept Search Solved?," IEEE Computer, vol. 43, iss. 6, pp. 76-78, 2010.
    @ARTICLE{SnoekCOM10,
      author = {Cees G. M. Snoek and Arnold W. M. Smeulders},
      title = {Visual-Concept Search Solved?},
      journal = {{IEEE} Computer},
      pages = {76--78},
      month = {June},
      year = {2010},
      volume = {43},
      number = {6},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-smeulders-solved-computer.pdf},
      data = {http://www.mediamill.nl/progress/},
      abstract = { Progress in visual-concept search suggests that machine understanding of images is within reach. }
    }
  32. Jan C. van Gemert, Cees G. M. Snoek, Cor J. Veenman, Arnold W. M. Smeulders, and Jan-Mark Geusebroek, "Comparing Compact Codebooks for Visual Categorization," Computer Vision and Image Understanding, vol. 114, iss. 4, pp. 450-462, 2010.
    @ARTICLE{GemertCVIU10,
      author = {Jan C. van Gemert and Cees G. M. Snoek and Cor J. Veenman and Arnold W. M. Smeulders and Jan-Mark Geusebroek},
      title = {Comparing Compact Codebooks for Visual Categorization},
      journal = {Computer Vision and Image Understanding},
      pages = {450--462},
      month = {April},
      year = {2010},
      volume = {114},
      number = {4},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/gemert-compact-codebooks-cviu.pdf},
      abstract = { In the face of current large-scale video libraries, the practical applicability of content-based indexing algorithms is constrained by their efficiency. This paper strives for efficient large-scale video indexing by comparing various visual-based concept categorization techniques. In visual categorization, the popular codebook model has shown excellent categorization performance. The codebook model represents continuous visual features by discrete prototypes predefined in a vocabulary. The vocabulary size has a major impact on categorization efficiency, where a more compact vocabulary is more efficient. However, smaller vocabularies typically score lower on classification performance than larger vocabularies. This paper compares four approaches to achieve a compact codebook vocabulary while retaining categorization performance. For these four methods, we investigate the trade-off between codebook compactness and categorization performance. We evaluate the methods on more than 200 h of challenging video data with as many as 101 semantic concepts. The results allow us to create a taxonomy of the four methods based on their efficiency and categorization performance. }
    }
  33. Xirong Li, Cees G. M. Snoek, and Marcel Worring, "Learning Social Tag Relevance by Neighbor Voting," IEEE Transactions on Multimedia, vol. 11, iss. 7, pp. 1310-1322, 2009.
    IEEE Transactions on Multimedia Prize Paper Award 2012
    @ARTICLE{LiTMM09,
      author = {Xirong Li and Cees G. M. Snoek and Marcel Worring},
      title = {Learning Social Tag Relevance by Neighbor Voting},
      journal = {{IEEE} Transactions on Multimedia},
      pages = {1310--1322},
      month = {November},
      year = {2009},
      volume = {11},
      number = {7},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-socialtagrelevance-tmm.pdf},
      data = {http://staff.science.uva.nl/~xirong/tagrel/},
      note = {IEEE Transactions on Multimedia Prize Paper Award 2012},
      abstract = { Social image analysis and retrieval is important for helping people organize and access the increasing amount of user-tagged multimedia. Since user tagging is known to be uncontrolled, ambiguous, and overly personalized, a fundamental problem is how to interpret the relevance of a user-contributed tag with respect to the visual content the tag is describing. Intuitively, if different persons label visually similar images using the same tags, these tags are likely to reflect objective aspects of the visual content. Starting from this intuition, we propose in this paper a neighbor voting algorithm which accurately and efficiently learns tag relevance by accumulating votes from visual neighbors. Under a set of well defined and realistic assumptions, we prove that our algorithm is a good tag relevance measurement for both image ranking and tag ranking. Three experiments on 3.5 million Flickr photos demonstrate the general applicability of our algorithm in both social image retrieval and image tag suggestion. Our tag relevance learning algorithm substantially improves upon baselines for all the experiments. The results suggest that the proposed algorithm is promising for real-world applications. }
    }
  34. Cees G. M. Snoek and Marcel Worring, "Concept-Based Video Retrieval," Foundations and Trends in Information Retrieval, vol. 4, iss. 2, pp. 215-322, 2009.
    @ARTICLE{SnoekFNTIR09,
      author = {Cees G. M. Snoek and Marcel Worring},
      title = {Concept-Based Video Retrieval},
      journal = {Foundations and Trends in Information Retrieval},
      pages = {215--322},
      month = {},
      year = {2009},
      volume = {4},
      number = {2},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-concept-based-video-retrieval-fntir.pdf},
      abstract = { In this paper, we review 300 references on video retrieval, indicating when text-only solutions are unsatisfactory and showing the promising alternatives which are in majority concept-based. Therefore, central to our discussion is the notion of a semantic concept: an objective linguistic description of an observable entity. Specifically, we present our view on how its automated detection, selection under uncertainty, and interactive usage might solve the major scientific problem for video retrieval: the semantic gap. To bridge the gap, we lay down the anatomy of a concept-based video search engine. We present a component-wise decomposition of such an interdisciplinary multimedia system, covering influences from information retrieval, computer vision, machine learning, and human-computer interaction. For each of the components we review state-of-the-art solutions in the literature, each having different characteristics and merits. Because of these differences, we cannot understand the progress in video retrieval without serious evaluation efforts such as carried out in the NIST TRECVID benchmark. We discuss its data, tasks, results, and the many derived community initiatives in creating annotations and baselines for repeatable experiments. We conclude with our perspective on future challenges and opportunities. }
    }
  35. Cees G. M. Snoek, Marcel Worring, Ork de Rooij, Koen E. A. van de Sande, Rong Yan, and Alexander G. Hauptmann, "VideOlympics: Real-Time Evaluation of Multimedia Retrieval Systems," IEEE Multimedia, vol. 15, iss. 1, pp. 86-91, 2008.
    @ARTICLE{SnoekMM08,
      author = {Cees G. M. Snoek and Marcel Worring and Ork de Rooij and Koen E. A. {van de Sande} and Rong Yan and Alexander G. Hauptmann},
      title = {{VideOlympics}: Real-Time Evaluation of Multimedia Retrieval Systems},
      journal = {{IEEE} Multimedia},
      pages = {86--91},
      month = {January--March},
      year = {2008},
      volume = {15},
      number = {1},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-videolympics-mm.pdf},
      demo = {http://www.ceessnoek.info/index.php/demonstrations/videolympics/},
      abstract = { Video search is an experience for the senses. As a result, traditional information retrieval metrics can't fully measure the quality of a video search system. To provide a more interactive assessment of today's video search engines, the authors have organized the VideOlympics as a real-time evaluation showcase where systems compete to answer specific video searches in front of a live audience. At VideOlympics, seeing and hearing is believing. }
    }
  36. Frank J. Seinstra, Jan-Mark Geusebroek, Dennis Koelma, Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders, "High-Performance Distributed Image and Video Content Analysis with Parallel-Horus," IEEE Multimedia, vol. 14, iss. 4, pp. 64-75, 2007.
    @ARTICLE{SeinstraMM07,
      author = {Frank J. Seinstra and Jan-Mark Geusebroek and Dennis Koelma and Cees G. M. Snoek and Marcel Worring and Arnold W. M. Smeulders},
      title = {High-Performance Distributed Image and Video Content Analysis with Parallel-Horus},
      journal = {{IEEE} Multimedia},
      pages = {64--75},
      month = {October--December},
      year = {2007},
      volume = {14},
      number = {4},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/seinstra-parallel-horus-mm.pdf},
      demo = {http://staff.science.uva.nl/~fjseins/isis/AiboDemo/AiboVideo.html},
      abstract = { As the world uses more digital video that requires greater storage space, Grid computing is becoming indispensable for urgent problems in multimedia content analysis. Parallel-Horus, a support tool for applications in multimedia Grid computing, lets users implement multimedia applications as sequential programs for efficient execution on clusters and Grids, based on wide-area multimedia services. }
    }
  37. Cees G. M. Snoek, Bouke Huurnink, Laura Hollink, Maarten de Rijke, Guus Schreiber, and Marcel Worring, "Adding Semantics to Detectors for Video Retrieval," IEEE Transactions on Multimedia, vol. 9, iss. 5, pp. 975-986, 2007.
    @ARTICLE{SnoekTMM07b,
      author = {Cees G. M. Snoek and Bouke Huurnink and Laura Hollink and Maarten de Rijke and Guus Schreiber and Marcel Worring},
      title = {Adding Semantics to Detectors for Video Retrieval},
      journal = {{IEEE} Transactions on Multimedia},
      month = {August},
      year = {2007},
      volume = {9},
      number = {5},
      pages = {975--986},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-semantics2detectors-tmm.pdf},
      abstract = { In this paper, we propose an automatic video retrieval method based on high-level concept detectors. Research in video analysis has reached the point where over 100 concept detectors can be learned in a generic fashion, albeit with mixed performance. Such a set of detectors is very small still compared to ontologies aiming to capture the full vocabulary a user has. We aim to throw a bridge between the two fields by building a multimedia thesaurus, i.e., a set of machine learned concept detectors that is enriched with semantic descriptions and semantic structure obtained from WordNet. Given a multimodal user query, we identify three strategies to select a relevant detector from this thesaurus, namely: text matching, ontology querying, and semantic visual querying. We evaluate the methods against the automatic search task of the TRECVID 2005 video retrieval benchmark, using a news video archive of 85 h in combination with a thesaurus of 363 machine learned concept detectors. We assess the influence of thesaurus size on video search performance, evaluate and compare the multimodal selection strategies for concept detectors, and finally discuss their combined potential using oracle fusion. The set of queries in the TRECVID 2005 corpus is too small for us to be definite in our conclusions, but the results suggest promising new lines of research. }
    }
  38. Cees G. M. Snoek, Marcel Worring, Dennis C. Koelma, and Arnold W. M. Smeulders, "A Learned Lexicon-Driven Paradigm for Interactive Video Retrieval," IEEE Transactions on Multimedia, vol. 9, iss. 2, pp. 280-292, 2007.
    @ARTICLE{SnoekTMM07,
      author = {Cees G. M. Snoek and Marcel Worring and Dennis C. Koelma and Arnold W. M. Smeulders},
      title = {A Learned Lexicon-Driven Paradigm for Interactive Video Retrieval},
      journal = {{IEEE} Transactions on Multimedia},
      month = {February},
      year = {2007},
      volume = {9},
      number = {2},
      pages = {280--292},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-lexicon-tmm.pdf},
      demo = {http://http://www.ceessnoek.info/index.php/demonstrations/mediamill/},
      abstract = { Effective video retrieval is the result of an interplay between interactive query selection, advanced visualization of results, and a goal-oriented human user. Traditional interactive video retrieval approaches emphasize paradigms, such as query-by-keyword and query-by-example, to aid the user in the search for relevant footage. However, recent results in automatic indexing indicate that query-by-concept is becoming a viable resource for interactive retrieval also. We propose in this paper a new video retrieval paradigm. The core of the paradigm is formed by first detecting a large lexicon of semantic concepts. From there, we combine query-by-concept, query-by-example, query-by-keyword, and user interaction into the \emph{MediaMill} semantic video search engine. To measure the impact of increasing lexicon size on interactive video retrieval performance, we performed two experiments against the 2004 and 2005 NIST TRECVID benchmarks, using lexicons containing 32 and 101 concepts respectively. The results suggest that from all factors that play a role in interactive retrieval, a large lexicon of semantic concepts matters most. Indeed, by exploiting large lexicons, many video search questions are solvable without using query-by-keyword and query-by-example. What is more, we show that the lexicon-driven search engine outperforms all state-of-the-art video retrieval systems in both TRECVID 2004 and 2005. }
    }
  39. Cees G. M. Snoek, Marcel Worring, Jan-Mark Geusebroek, Dennis C. Koelma, Frank J. Seinstra, and Arnold W. M. Smeulders, "The Semantic Pathfinder: Using an Authoring Metaphor for Generic Multimedia Indexing," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, iss. 10, pp. 1678-1689, 2006.
    @ARTICLE{SnoekPAMI06,
      author = {Cees G. M. Snoek and Marcel Worring and Jan-Mark Geusebroek and Dennis C. Koelma and Frank J. Seinstra and Arnold W. M. Smeulders},
      title = {The Semantic Pathfinder: Using an Authoring Metaphor for Generic Multimedia Indexing},
      journal = {{IEEE} Transactions on Pattern Analysis and Machine Intelligence},
      month = {October},
      year = {2006},
      volume = {28},
      number = {10},
      pages = {1678--1689},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-pathfinder-pami.pdf},
      data = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-pathfinder-pami-groundtruth.zip},
      demo = {http://www.ceessnoek.info/index.php/demonstrations/semantic-pathfinder/},
      abstract = { This paper presents the semantic pathfinder architecture for generic indexing of multimedia archives. The semantic pathfinder extracts semantic concepts from video by exploring different paths through three consecutive analysis steps, which we derive from the observation that produced video is the result of an authoring-driven process. We exploit this \emph{authoring metaphor} for machine-driven understanding. The pathfinder starts with the content analysis step. In this analysis step, we follow a data-driven approach of indexing semantics. The style analysis step is the second analysis step. Here we tackle the indexing problem by viewing a video from the perspective of production. Finally, in the context analysis step, we view semantics in context. The virtue of the semantic pathfinder is its ability to learn the best path of analysis steps on a per-concept basis. To show the generality of this novel indexing approach we develop detectors for a lexicon of 32 concepts and we evaluate the semantic pathfinder against the 2004 NIST TRECVID video retrieval benchmark, using a news archive of 64 hours. Top ranking performance in the semantic concept detection task indicates the merit of the semantic pathfinder for generic indexing of multimedia archives. }
    }
  40. Cees G. M. Snoek, Marcel Worring, and Alexander G. Hauptmann, "Learning Rich Semantics from News Video Archives by Style Analysis," ACM Transactions on Multimedia Computing, Communications and Applications, vol. 2, iss. 2, pp. 91-108, 2006.
    @ARTICLE{SnoekTOMCCAP06,
      author = {Cees G. M. Snoek and Marcel Worring and Alexander G. Hauptmann},
      title = {Learning Rich Semantics from News Video Archives by Style Analysis},
      journal = {{ACM} Transactions on Multimedia Computing, Communications and Applications},
      month = {May},
      year = {2006},
      volume = {2},
      number = {2},
      pages = {91--108},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-style-tomccap.pdf},
      abstract = { We propose a generic and robust framework for news video indexing, which we found on a broadcast news production model. We identify within this model four production phases, each providing useful metadata for annotation. In contrast to semi-automatic indexing approaches, which exploit this information at production time, we adhere to an automatic data-driven approach. To that end, we analyze a digital news video using a separate set of multimodal detectors for each production phase. By combining the resulting production-derived features into a statistical classifier ensemble, the framework facilitates robust classification of several rich semantic concepts in news video; rich meaning that concepts share many similarities in their production process. Experiments on an archive of 120 hours of news video, from the 2003 TRECVID benchmark, show that a combined analysis of production phases yields the best results. In addition, we demonstrate that the accuracy of the proposed style analysis framework for classification of several rich semantic concepts is state-of-the-art. }
    }
  41. Cees G. M. Snoek and Marcel Worring, "Multimedia Event-Based Video Indexing using Time Intervals," IEEE Transactions on Multimedia, vol. 7, iss. 4, pp. 638-647, 2005.
    @ARTICLE{SnoekTMM05,
      author = {Cees G. M. Snoek and Marcel Worring},
      title = {Multimedia Event-Based Video Indexing using Time Intervals},
      journal = {{IEEE} Transactions on Multimedia},
      month = {August},
      year = {2005},
      volume = {7},
      number = {4},
      pages = {638--647},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-time-mm.pdf},
      demo = {http://isis-data.science.uva.nl/cgmsnoek/goalgle/},
      abstract = { We propose the Time Interval Multimedia Event (TIME) framework as a robust approach for classification of semantic events in multimodal video documents. The representation used in TIME extends the Allen time relations and allows for proper inclusion of context and synchronization of the heterogeneous information sources involved in multimodal video analysis. To demonstrate the viability of our approach, it was evaluated on the domains of soccer and news broadcasts. For automatic classification of semantic events, we compare three different machine learning techniques, i.c. C4.5 decision tree, Maximum Entropy, and Support Vector Machine. The results show that semantic video indexing results significantly benefit from using the TIME framework. }
    }
  42. Cees G. M. Snoek and Marcel Worring, "Multimodal Video Indexing: A Review of the State-of-the-art," Multimedia Tools and Applications, vol. 25, iss. 1, pp. 5-35, 2005.
    @ARTICLE{SnoekMMTA05,
      author = {Cees G. M. Snoek and Marcel Worring},
      title = {Multimodal Video Indexing: A Review of the State-of-the-art},
      journal = {Multimedia Tools and Applications},
      month = {January},
      year = {2005},
      volume = {25},
      number = {1},
      pages = {5--35},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/snoek-review-mmta.pdf},
      abstract = { Efficient and effective handling of video documents depends on the availability of indexes. Manual indexing is unfeasible for large video collections. In this paper we survey several methods aiming at automating this time and resource consuming process. Good reviews on single modality based video indexing have appeared in literature. Effective indexing, however, requires a multimodal approach in which either the most appropriate modality is selected or the different modalities are used in collaborative fashion. Therefore, instead of separately treating the different information sources involved, and their specific algorithms, we focus on the similarities and differences between the modalities. To that end we put forward a unifying and multimodal framework, which views a video document from the perspective of its author. This framework forms the guiding principle for identifying index types, for which automatic methods are found in literature. It furthermore forms the basis for categorizing these different methods. }
    }
This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Comments are closed.