Publications

Benchmarks

  1. Cees G. M. Snoek, Jianfeng Dong, Xirong Li, Xiaoxu Wang, Qijie Wei, Weiyu Lan, Efstratios Gavves, Noureldien Hussein, Dennis C. Koelma, and Arnold W. M. Smeulders, "University of Amsterdam and Renmin University at TRECVID 2016: Searching Video, Detecting Events and Describing Video," in Proceedings of the 14th TRECVID Workshop, Gaithersburg, USA, 2016.
    @INPROCEEDINGS{SnoekTRECVID16,
      author = {Cees G. M. Snoek and Jianfeng Dong and Xirong Li and Xiaoxu Wang and Qijie Wei and Weiyu Lan and Efstratios Gavves and Noureldien Hussein and Dennis C. Koelma and Arnold W. M. Smeulders},
      title = {University of Amsterdam and Renmin University at TRECVID 2016: Searching Video, Detecting Events and Describing Video},
      booktitle = {Proceedings of the 14th TRECVID Workshop},
      pages = {},
      month = {November},
      year = {2016},
      address = {Gaithersburg, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mediamill-TRECVID2016-final.pdf},
      abstract = { In this paper we summarize our TRECVID 2016 video recognition experiments. We participated in three tasks: video search, event detection and video description. Here we describe the tasks on event detection and video description. For event detection we explore semantic representations based on VideoStory and an ImageNet Shuffle for both zero-shot and few-example regimes. For the showcase task on video description we experiment with a deep network that predicts a visual representation from a natural language description, and use this space for the sentence matching. For generative description we enhance a neural image captioning model with Early Embedding and Late Reranking. The 2016 edition of the TRECVID benchmark has been a fruitful participation for our joint-team, resulting in the best overall result for zero- and few-example event detection as well as video description by matching and in generative mode. }
    }
  2. Cees G. M. Snoek, Spencer Cappallo, Daniel Fontijne, David Julian, Dennis C. Koelma, Pascal Mettes, Koen E. A. van de Sande, Anthony Sarah, Harro Stokman, and R. Blythe Towal, "Qualcomm Research and University of Amsterdam at TRECVID 2015: Recognizing Concepts, Objects, and Events in Video," in Proceedings of the 13th TRECVID Workshop, Gaithersburg, USA, 2015.
    @INPROCEEDINGS{SnoekTRECVID15,
      author = {Cees G. M. Snoek and Spencer Cappallo and Daniel Fontijne and David Julian and Dennis C. Koelma and Pascal Mettes and Koen E. A. van de Sande and Anthony Sarah and Harro Stokman and R. {Blythe Towal}},
      title = {Qualcomm Research and University of Amsterdam at {TRECVID} 2015: Recognizing Concepts, Objects, and Events in Video},
      booktitle = {Proceedings of the 13th TRECVID Workshop},
      pages = {},
      month = {November},
      year = {2015},
      address = {Gaithersburg, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mediamill-TRECVID2015-final.pdf},
      abstract = { In this paper we summarize our TRECVID 2015 video recognition experiments. We participated in three tasks: concept detection, object localization, and event recognition, where Qualcomm Research focused on concept detection and object localization and the University of Amsterdam focused on event detection. For concept detection we start from the very deep networks that excelled in the ImageNet 2014 competition and redesign them for the purpose of video recognition, emphasizing on training data augmentation %, permutation, and dropout as well as video fine-tuning. Our entry in the localization task is based on classifying a limited number of boxes in each frame using deep learning features. The boxes are proposed by an improved version of selective search. At the core of our multimedia event detection system is an Inception-style deep convolutional neural network that is trained on the full ImageNet hierarchy with 22k categories. We propose several operations that combine and generalize the ImageNet categories to form a desirable set of (super-)categories, while still being able to train a reliable model. The 2015 edition of the TRECVID benchmark has been a fruitful participation for our team, resulting in the best overall result for concept detection, object localization and event detection. }
    }
  3. Mihir Jain, Jan van Gemert, Pascal Mettes, and Cees G. M. Snoek, "University of Amsterdam at THUMOS Challenge 2015," in CVPR THUMOS Challenge 2015, Boston, USA, 2015.
    @INPROCEEDINGS{JainTHUMOS15,
      author = {Mihir Jain and Jan van Gemert and Pascal Mettes and Cees G. M. Snoek},
      title = {University of Amsterdam at THUMOS Challenge 2015},
      booktitle = {CVPR THUMOS Challenge 2015},
      pages = {},
      month = {June},
      year = {2015},
      address = {Boston, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/jain-THUMOS2015-final.pdf},
      abstract = { This notebook paper describes our approach for the action classification task of the THUMOS 2015 benchmark challenge. We use two types of representations to capture motion and appearance. For a local motion description we employ HOG, HOF and MBH features, computed along the improved dense trajectories. The motion features are encoded into a fixed-length representation using Fisher vectors. For the appearance features, we employ a pre-trained GoogLeNet convolutional network on video frames. VLAD is used to encode the appearance features into a fixed-length representation. All actions are classified with a one-vs-rest linear SVM. }
    }
  4. Cees G. M. Snoek, Koen E. A. van de Sande, Daniel Fontijne, Spencer Cappallo, Jan van Gemert, Amirhossein Habibian, Thomas Mensink, Pascal Mettes, Ran Tao, Dennis C. Koelma, and Arnold W. M. Smeulders, "MediaMill at TRECVID 2014: Searching Concepts, Objects, Instances and Events in Video," in Proceedings of the 12th TRECVID Workshop, Orlando USA, 2014.
    @INPROCEEDINGS{SnoekTRECVID14,
      author = {Cees G. M. Snoek and Koen E. A. van de Sande and Daniel Fontijne and Spencer Cappallo and Jan van Gemert and Amirhossein Habibian and Thomas Mensink and Pascal Mettes and Ran Tao and Dennis C. Koelma and Arnold W. M. Smeulders},
      title = {{MediaMill} at {TRECVID} 2014: Searching Concepts, Objects, Instances and Events in Video},
      booktitle = {Proceedings of the 12th TRECVID Workshop},
      pages = {},
      month = {November},
      year = {2014},
      address = {Orlando USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mediamill-TRECVID2014-final.pdf},
      abstract = { In this paper we summarize our TRECVID 2014 video retrieval experiments. The MediaMill team participated in five tasks: concept detection, object localization, instance search, event recognition and recounting. We experimented with concept detection using deep learning and color difference coding, object localization using FLAIR, instance search by one example, event recognition based on VideoStory, and event recounting using COSTA. Our experiments focus on establishing the video retrieval value of these innovations. The 2014 edition of the TRECVID benchmark has again been a fruitful participation for the MediaMill team, resulting in the best result for concept detection and object localization. }
    }
  5. Robert C. Bolles, Brian J. Burns, James A. Herson, Gregory K. Myers, Julien van Hout, Wen Wang, Julie Wong, Eric Yeh, Amirhossein Habibian, Dennis C. Koelma, Thomas Mensink, Arnold W. M. Smeulders, Cees G. M. Snoek, Arnav Aggarwal, Song Cao, Kan Chen, Rama Kovvuri, Ram Nevatia, and Pramod Sharma, "The 2014 SESAME Multimedia Event Detection and Recounting System," in Proceedings of the 12th TRECVID Workshop, Orlando USA, 2014.
    @INPROCEEDINGS{SesameTRECVID14,
      author = {Robert C. Bolles and J. Brian Burns and James A. Herson and Gregory K. Myers and Julien van Hout and Wen Wang and Julie Wong and Eric Yeh and Amirhossein Habibian and Dennis C. Koelma and Thomas Mensink and Arnold W. M. Smeulders and Cees G. M. Snoek and Arnav Aggarwal and Song Cao and Kan Chen and Rama Kovvuri and Ram Nevatia and Pramod Sharma},
      title = {The 2014 SESAME Multimedia Event Detection and Recounting System},
      booktitle = {Proceedings of the 12th TRECVID Workshop},
      pages = {},
      month = {November},
      year = {2014},
      address = {Orlando USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/Sesame-TRECVID2014-final.pdf},
      abstract = { The SESAME (video SEarch with Speed and Accuracy for Multimedia Events) team submitted six runs as a full participant in the Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) evaluations. The SESAME system combines low-level visual, audio, and motion features; high-level semantic concepts for visual objects, scenes, persons, sounds, and actions; automatic speech recognition (ASR); and video optical character recognition (OCR). These three types of features and five types of concepts were used in eight event classifiers. One of the event classifiers, VideoStory, is a new approach that exploits the relationship between semantic concepts and imagery in a large training corpus. The SESAME system uses a total of over 18,000 concepts. We combined the event-detection results for these classifiers using a log-likelihood ratio (LLR) late-fusion method, which uses logistic regression to learn combination weights for event-detection scores from multiple classifiers originating from different data types. The SESAME system generated event recountings based on visual and action concepts, and on concepts recognized by ASR and OCR. Training data included the MED Research dataset, ImageNet, a video dataset from YouTube, the UCF101 and HMDB51 action datasets, the NIST SIN dataset, and Wikipedia. The components that contributed most significantly to event-detection performance were the low- and high-level visual features, low-level motion features, and VideoStory. The LLR late-fusion method significantly improved performance over the best individual classifier for 100Ex and 010Ex. For the Semantic Query (SQ), equal fusion weights, instead of the LLR method, were used in fusion due to the absence of training data. }
    }
  6. Mihir Jain, Jan van Gemert, and Cees G. M. Snoek, "University of Amsterdam at THUMOS Challenge 2014," in ECCV THUMOS Challenge 2014, Zürich, Switzerland, 2014.
    @INPROCEEDINGS{JainTHUMOS14,
      author = {Mihir Jain and Jan van Gemert and Cees G. M. Snoek},
      title = {University of Amsterdam at THUMOS Challenge 2014},
      booktitle = {ECCV THUMOS Challenge 2014},
      pages = {},
      month = {September},
      year = {2014},
      address = {Z\"urich, Switzerland},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/jain-THUMOS2014-final.pdf},
      abstract = { This notebook paper describes our approach for the action classification task of the THUMOS Challenge 2014. We investigate and exploit the action-object relationship by capturing both motion and related objects. As local descriptors we use HOG, HOF and MBH computed along the improved dense trajectories. For video encoding we rely on Fisher vector. In addition, we employ deep net features learned from object attributes to capture action context. All actions are classified with a one-versus-rest linear SVM. }
    }
  7. Cees G. M. Snoek, Koen E. A. van de Sande, Daniel Fontijne, Amirhossein Habibian, Mihir Jain, Svetlana Kordumova, Zhenyang Li, Masoud Mazloom, Silvia-Laura Pintea, Ran Tao, Dennis C. Koelma, and Arnold W. M. Smeulders, "MediaMill at TRECVID 2013: Searching Concepts, Objects, Instances and Events in Video," in Proceedings of the 11th TRECVID Workshop, Gaithersburg, USA, 2013.
    @INPROCEEDINGS{SnoekTRECVID13,
      author = {Cees G. M. Snoek and Koen E. A. van de Sande and Daniel Fontijne and Amirhossein Habibian and Mihir Jain and Svetlana Kordumova and Zhenyang Li and Masoud Mazloom and Silvia-Laura Pintea and Ran Tao and Dennis C. Koelma and Arnold W. M. Smeulders},
      title = {{MediaMill} at {TRECVID} 2013: Searching Concepts, Objects, Instances and Events in Video},
      booktitle = {Proceedings of the 11th TRECVID Workshop},
      pages = {},
      month = {November},
      year = {2013},
      address = {Gaithersburg, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mediamill-TRECVID2013-final.pdf},
      abstract = { In this paper we summarize our TRECVID 2013 video retrieval experiments. The MediaMill team participated in four tasks: concept detection, object localization, instance search, and event recognition. For all tasks the starting point is our top-performing bag-of-words system of TRECVID 2008-2012, which uses color SIFT descriptors, average and difference coded into codebooks with spatial pyramids and kernel-based machine learning. New this year are concept detection with deep learning, concept detection without annotations, object localization using selective search, instance search by reranking, and event recognition based on concept vocabularies. Our experiments focus on establishing the video retrieval value of the innovations. The 2013 edition of the TRECVID benchmark has again been a fruitful participation for the MediaMill team, resulting in the best result for concept detection, concept detection without annotation, object localization, concept pair detection, and visual event recognition with few examples. }
    }
  8. Robert C. Bolles, Brian J. Burns, James A. Herson, Gregory K. Myers, Stephanie Pancoast, Julien van Hout, Wen Wang, Julie Wong, Eric Yeh, Amirhossein Habibian, Dennis C. Koelma, Zhenyang Li, Masoud Mazloom, Silvia-Laura Pintea, Arnold W. M. Smeulders, Cees G. M. Snoek, Sung Chun Lee, Ram Nevatia, Pramod Sharma, Chen Sun, and Remi Trichet, "The 2013 SESAME Multimedia Event Detection and Recounting System," in Proceedings of the 11th TRECVID Workshop, Gaithersburg, USA, 2013.
    @INPROCEEDINGS{SesameTRECVID13,
      author = {Robert C. Bolles and J. Brian Burns and James A. Herson and Gregory K. Myers and Stephanie Pancoast and Julien van Hout and Wen Wang and Julie Wong and Eric Yeh and Amirhossein Habibian and Dennis C. Koelma and Zhenyang Li and Masoud Mazloom and Silvia-Laura Pintea and Arnold W. M. Smeulders and Cees G. M. Snoek and Sung Chun Lee and Ram Nevatia and Pramod Sharma and Chen Sun and Remi Trichet},
      title = {The 2013 SESAME Multimedia Event Detection and Recounting System},
      booktitle = {Proceedings of the 11th TRECVID Workshop},
      pages = {},
      month = {November},
      year = {2013},
      address = {Gaithersburg, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/Sesame-TRECVID2013-final.pdf},
      abstract = { The SESAME team submitted runs as a full participant in the MED13 evaluation, and submitted video, motion, and audio features; high-level semantic concepts for visual objects, scenes, persons, and actions; automatic speech recognition (ASR); and video optical character recognition (OCR). The individual types of features and concepts produced a total of eight event classifiers. We combined the event detection results for these classifiers using arithmetic mean and log-likelihood ratio fusion methods, and developed and applied a method for selecting the detection threshold. The SESAME system generated event recountings by selecting intervals based on the semantic concepts, and on concepts recognized by ASR and OCR. Our major findings are: Our strategy of first selecting the most informative interval for a video, and then determining the most appropriate event-related semantic concepts within that interval to display for multimedia event recounting (MER), produced the best ObsTextScore in the evaluation. (The ObsTextScore measures the judges’ responses to the question “How well does the text of this observation describe the snippet(s)?”.) The multimedia event detection (MED) performance for 100Ex and 10Ex was dominated by the classifiers that exploited visual content. The ASR and OCR classifiers for 0Ex performed better than those trained with 10Ex. The log-likelihood ratio late-fusion method demonstrated improved performance over simple averaging of event detection scores for 100Ex, but not for 10Ex. }
    }
  9. Cees G. M. Snoek, Koen E. A. van de Sande, Amirhossein Habibian, Svetlana Kordumova, Zhenyang Li, Masoud Mazloom, Silvia-Laura Pintea, Ran Tao, Dennis C. Koelma, and Arnold W. M. Smeulders, "The MediaMill TRECVID 2012 Semantic Video Search Engine," in Proceedings of the 10th TRECVID Workshop, Gaithersburg, USA, 2012.
    @INPROCEEDINGS{SnoekTRECVID12,
      author = {Cees G. M. Snoek and Koen E. A. van de Sande and Amirhossein Habibian and Svetlana Kordumova and Zhenyang Li and Masoud Mazloom and Silvia-Laura Pintea and Ran Tao and Dennis C. Koelma and Arnold W. M. Smeulders},
      title = {The {MediaMill} {TRECVID} 2012 Semantic Video Search Engine},
      booktitle = {Proceedings of the 10th TRECVID Workshop},
      pages = {},
      month = {November},
      year = {2012},
      address = {Gaithersburg, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mediamill-TRECVID2012-final.pdf},
      abstract = { In this paper we describe our TRECVID 2012 video retrieval experiments. The MediaMill team participated in four tasks: semantic indexing, multimedia event detection, multimedia event recounting and instance search. The starting point for the MediaMill detection approach is our top-performing bag-of-words system of TRECVID 2008-2011, which uses multiple color SIFT descriptors, averaged and difference coded into codebooks with spatial pyramids, and kernel-based machine learning. This year our concept detection experiments focus on establishing the influence of difference coding, the use of audio features, concept-pair detection using regular concepts, pair detection by spatiotemporal objects, and concept(-pair) detection without annotations. Our event detection and recounting experiments focus on representations using concept detectors. For instance search we study the influence of spatial verification and color invariance. The 2012 edition of the TRECVID benchmark has again been a fruitful participation for the MediaMill team, resulting in the runner-up ranking for concept detection in the semantic indexing task. }
    }
  10. Murat Akbacak, Robert C. Bolles, Brian J. Burns, Mark Eliot, Aaron Heller, James A. Herson, Gregory K. Myers, Ramesh Nallapati, Stephanie Pancoast, Julien van Hout, Eric Yeh, Amirhossein Habibian, Dennis C. Koelma, Zhenyang Li, Masoud Mazloom, Silvia-Laura Pintea, Koen E. A. van de Sande, Arnold W. M. Smeulders, Cees G. M. Snoek, Sung Chun Lee, Ram Nevatia, Pramod Sharma, Chen Sun, and Remi Trichet, "The 2012 SESAME Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) Systems," in Proceedings of the 10th TRECVID Workshop, Gaithersburg, USA, 2012.
    @INPROCEEDINGS{SesameTRECVID12,
      author = {Murat Akbacak and Robert C. Bolles and J. Brian Burns and Mark Eliot and Aaron Heller and James A. Herson and Gregory K. Myers and Ramesh Nallapati and Stephanie Pancoast and Julien van Hout and Eric Yeh and Amirhossein Habibian and Dennis C. Koelma and Zhenyang Li and Masoud Mazloom and Silvia-Laura Pintea and Koen E. A. van de Sande and Arnold W. M. Smeulders and Cees G. M. Snoek and Sung Chun Lee and Ram Nevatia and Pramod Sharma and Chen Sun and Remi Trichet},
      title = {The 2012 SESAME Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) Systems},
      booktitle = {Proceedings of the 10th TRECVID Workshop},
      pages = {},
      month = {November},
      year = {2012},
      address = {Gaithersburg, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/Sesame-TRECVID2012-final.pdf},
      abstract = { The SESAME team submitted four runs for the MED12 pre-specified events, two runs for the ad hoc events, and a run for multimedia event recounting. The detection runs included combinations of low-level visual, motion, and audio features; high-level semantic visual concepts; and text- based modalities (automatic speech recognition [ASR] and video optical character recognition [OCR]). The individual types of features and concepts produced a total of 14 event classifiers. We combined the event detection results for these classifiers using three fusion methods, two of which relied on the particular set of detection scores that were available for each video clip. In addition, we applied three methods for selecting the detection threshold. Performance on the ad hoc events was comparable to that for the pre-specified events. Low-level visual features were the strongest performers across all training conditions and events. However, detectors based on visual concepts and low-level, motion-based features were very competitive in performance. }
    }
  11. Cees G. M. Snoek, Koen E. A. van de Sande, Xirong Li, Masoud Mazloom, Yu-Gang Jiang, Dennis C. Koelma, and Arnold W. M. Smeulders, "The MediaMill TRECVID 2011 Semantic Video Search Engine," in Proceedings of the 9th TRECVID Workshop, Gaithersburg, USA, 2011.
    @INPROCEEDINGS{SnoekTRECVID11,
      author = {Cees G. M. Snoek and Koen E. A. van de Sande and Xirong Li and Masoud Mazloom and Yu-Gang Jiang and Dennis C. Koelma and Arnold W. M. Smeulders},
      title = {The {MediaMill} {TRECVID} 2011 Semantic Video Search Engine},
      booktitle = {Proceedings of the 9th TRECVID Workshop},
      pages = {},
      month = {December},
      year = {2011},
      address = {Gaithersburg, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mediamill-TRECVID2011-final.pdf},
      abstract = { In this paper we describe our TRECVID 2011 video retrieval experiments. The MediaMill team participated in two tasks: semantic indexing and multimedia event detection. The starting point for the MediaMill detection approach is our top-performing bag-of-words system of TRECVID 2010, which uses multiple color SIFT descriptors, sparse codebooks with spatial pyramids, and kernel-based machine learning. All supported by GPU-optimized algorithms, approximated histogram intersection kernels, and multi-frame video processing. This year our experiments focus on 1) the soft assignment of descriptors with the use of difference coding, 2) the exploration of bag-of-words for event detection, and 3) the selection of informative concepts out of 1,346 concept detectors as a representation for event detection. The 2011 edition of the TRECVID benchmark has again been a fruitful participation for the MediaMill team, resulting in the runner-up ranking for concept detection in the semantic indexing task. }
    }
  12. Murat Akbacak, Robert C. Bolles, Brian J. Burns, Mark Eliot, Aaron Heller, James A. Herson, Gregory K. Myers, Ramesh Nallapati, Eric Yeh, Dennis C. Koelma, Xirong Li, Masoud Mazloom, Koen E. A. van de Sande, Arnold W. M. Smeulders, Cees G. M. Snoek, Sung Chun Lee, Ram Nevatia, Pramod Sharma, Chen Sun, and Remi Trichet, "The 2011 SESAME Multimedia Event Detection (MED) System," in Proceedings of the 9th TRECVID Workshop, Gaithersburg, USA, 2011.
    @INPROCEEDINGS{SesameTRECVID11,
      author = {Murat Akbacak and Robert C. Bolles and J. Brian Burns and Mark Eliot and Aaron Heller and James A. Herson and Gregory K. Myers and Ramesh Nallapati and Eric Yeh and Dennis C. Koelma and Xirong Li and Masoud Mazloom and Koen E. A. van de Sande and Arnold W. M. Smeulders and Cees G. M. Snoek and Sung Chun Lee and Ram Nevatia and Pramod Sharma and Chen Sun and Remi Trichet},
      title = {The 2011 SESAME Multimedia Event Detection (MED) System},
      booktitle = {Proceedings of the 9th TRECVID Workshop},
      pages = {},
      month = {December},
      year = {2011},
      address = {Gaithersburg, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/Sesame-TRECVID2011-final.pdf},
      abstract = { The SESAME team submitted four MED-11 runs which combined video content extraction results consisting of visual features, video OCR results, and motion features. The primary run and one of the secondary runs used two different methods of fusing visual features and OCR results; a third run combined visual features and motion features; and a fourth run combined visual features, OCR results, and motion features. Results were combined using rank-based fusion and weighted averages. We found that rank-based fusion of visual feature results and video OCR results (the primary run) had the best performance of the four runs. The initial performance of the runs with motion features, which were computed around keyframes, was poor, but a subsequent experiment showed that motion features can indeed contribute to improved performance. }
    }
  13. Koen E. A. van de Sande and Cees G. M. Snoek, "The University of Amsterdam’s Concept Detection System at ImageCLEF 2011," in Proceedings of the ImageCLEF Workshop, Amsterdam, The Netherlands, 2011.
    @INPROCEEDINGS{SandeCLEF11,
      author = {Koen E. A. van de Sande and Cees G. M. Snoek},
      title = {The University of Amsterdam's Concept Detection System at {ImageCLEF} 2011},
      booktitle = {Proceedings of the ImageCLEF Workshop},
      pages = {},
      month = {September},
      year = {2011},
      address = {Amsterdam, The Netherlands},
      pdf = {},
      abstract = {}
    }
  14. Cees G. M. Snoek, Koen E. A. van de Sande, Ork de Rooij, Bouke Huurnink, Efstratios Gavves, Daan Odijk, Maarten de Rijke, Theo Gevers, Marcel Worring, Dennis C. Koelma, and Arnold W. M. Smeulders, "The MediaMill TRECVID 2010 Semantic Video Search Engine," in Proceedings of the 8th TRECVID Workshop, Gaithersburg, USA, 2010.
    @INPROCEEDINGS{SnoekTRECVID10,
      author = {Cees G. M. Snoek and Koen E. A. van de Sande and Ork de Rooij and Bouke Huurnink and Efstratios Gavves and Daan Odijk and Maarten de Rijke and Theo Gevers and Marcel Worring and Dennis C. Koelma and Arnold W. M. Smeulders},
      title = {The {MediaMill} {TRECVID} 2010 Semantic Video Search Engine},
      booktitle = {Proceedings of the 8th TRECVID Workshop},
      pages = {},
      month = {November},
      year = {2010},
      address = {Gaithersburg, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mediamill-TRECVID2010-final.pdf},
      abstract = { In this paper we describe our TRECVID 2010 video retrieval experiments. The MediaMill team participated in three tasks: semantic indexing, known-item search, and instance search. The starting point for the MediaMill concept detection approach is our top-performing bag-of-words system of TRECVID 2009, which uses multiple color SIFT descriptors, sparse codebooks with spatial pyramids, kernel-based machine learning, and multi-frame video processing. We improve upon this baseline system by further speeding up its execution times for both training and classification using GPU-optimized algorithms, approximated histogram intersection kernels, and several multi-frame combination methods. Being more efficient allowed us to supplement the Internet video training collection with positively labeled examples from international news broadcasts and Dutch documentary video from the TRECVID 2005-2009 benchmarks. Our experimental setup covered a huge training set of 170 thousand keyframes and a test set of 600 thousand keyframes in total. Ultimately leading to 130 robust concept detectors for video retrieval. For retrieval, a robust but limited set of concept detectors justifies the need to rely on as many auxiliary information channels as possible. For automatic known item search we therefore explore how we can learn to rank various information channels simultaneously to maximize video search results for a given topic. To further improve the video retrieval results, our interactive known item search experiments investigate how to combine metadata search and visualization into a single interface. The 2010 edition of the TRECVID benchmark has again been a fruitful participation for the MediaMill team, resulting in the top ranking for concept detection in the semantic indexing task. }
    }
  15. Cees G. M. Snoek, Koen E. A. van de Sande, Ork de Rooij, Bouke Huurnink, Jasper R. R. Uijlings, Michiel van Liempt, Miguel Bugalho, Isabel Trancoso, Fei Yan, Muhammad A. Tahir, Krystian Mikolajczyk, Josef Kittler, Maarten de Rijke, Jan-Mark Geusebroek, Theo Gevers, Marcel Worring, Dennis C. Koelma, and Arnold W. M. Smeulders, "The MediaMill TRECVID 2009 Semantic Video Search Engine," in Proceedings of the 7th TRECVID Workshop, Gaithersburg, USA, 2009.
    @INPROCEEDINGS{SnoekTRECVID09,
      author = {Cees G. M. Snoek and Koen E. A. van de Sande and Ork de Rooij and Bouke Huurnink and Jasper R. R. Uijlings and Michiel van Liempt and Miguel Bugalho and Isabel Trancoso and Fei Yan and Muhammad A. Tahir and Krystian Mikolajczyk and Josef Kittler and Maarten de Rijke and Jan-Mark Geusebroek and Theo Gevers and Marcel Worring and Dennis C. Koelma and Arnold W. M. Smeulders},
      title = {The {MediaMill} {TRECVID} 2009 Semantic Video Search Engine},
      booktitle = {Proceedings of the 7th TRECVID Workshop},
      pages = {},
      month = {November},
      year = {2009},
      address = {Gaithersburg, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mediamill-TRECVID2009-final.pdf},
      abstract = { In this paper we describe our TRECVID 2009 video retrieval experiments. The MediaMill team participated in three tasks: concept detection, automatic search, and interactive search. The starting point for the MediaMill concept detection approach is our top-performing bag-of-words system of last year, which uses multiple color descriptors, codebooks with soft-assignment, and kernel-based supervised learning. We improve upon this baseline system by exploring two novel research directions. Firstly, we study a multi-modal extension by including 20 audio concepts and fusion using two novel multi-kernel supervised learning methods. Secondly, with the help of recently proposed algorithmic refinements of bag-of-word representations, a GPU implementation, and compute clusters, we scale-up the amount of visual information analyzed by an order of magnitude, to a total of 1,000,000 i-frames. Our experiments evaluate the merit of these new components, ultimately leading to 64 robust concept detectors for video retrieval. For retrieval, a robust but limited set of concept detectors justifies the need to rely on as many auxiliary information channels as possible. For automatic search we therefore explore how we can learn to rank various information channels simultaneously to maximize video search results for a given topic. To further improve the video retrieval results, our interactive search experiments investigate the roles of visualizing preview results for a certain browse-dimension and relevance feedback mechanisms that learn to solve complex search topics by analysis from user browsing behavior. The 2009 edition of the TRECVID benchmark has again been a fruitful participation for the MediaMill team, resulting in the top ranking for both concept detection and interactive search. Again a lot has been learned during this year's TRECVID campaign; we highlight the most important lessons at the end of this paper. }
    }
  16. Cees G. M. Snoek, Koen E. A. van de Sande, Ork de Rooij, Bouke Huurnink, Jan C. van Gemert, Jasper R. R. Uijlings, Jiyin He, Xirong Li, Ivo Everts, Vladimir Nedovi’c, Michiel van Liempt, Richard van Balen, Fei Yan, Muhammad A. Tahir, Krystian Mikolajczyk, Josef Kittler, Maarten de Rijke, Jan-Mark Geusebroek, Theo Gevers, Marcel Worring, Arnold W. M. Smeulders, and Dennis C. Koelma, "The MediaMill TRECVID 2008 Semantic Video Search Engine," in Proceedings of the 6th TRECVID Workshop, Gaithersburg, USA, 2008.
    @INPROCEEDINGS{SnoekTRECVID08,
      author = {Cees G. M. Snoek and Koen E. A. van de Sande and Ork de Rooij and Bouke Huurnink and Jan C. van Gemert and Jasper R. R. Uijlings and Jiyin He and Xirong Li and Ivo Everts and Vladimir Nedovi\'c and Michiel van Liempt and Richard van Balen and Fei Yan and Muhammad A. Tahir and Krystian Mikolajczyk and Josef Kittler and Maarten de Rijke and Jan-Mark Geusebroek and Theo Gevers and Marcel Worring and Arnold W. M. Smeulders and Dennis C. Koelma},
      title = {The {MediaMill} {TRECVID} 2008 Semantic Video Search Engine},
      booktitle = {Proceedings of the 6th TRECVID Workshop},
      pages = {},
      month = {November},
      year = {2008},
      address = {Gaithersburg, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mediamill-TRECVID2008-final.pdf},
      abstract = { In this paper we describe our TRECVID 2008 video retrieval experiments. The MediaMill team participated in three tasks: concept detection, automatic search, and interactive search. Rather than continuing to increase the number of concept detectors available for retrieval, our TRECVID 2008 experiments focus on increasing the robustness of a small set of detectors using a bag-of-words approach. To that end, our concept detection experiments emphasize in particular the role of visual sampling, the value of color invariant features, the influence of codebook construction, and the effectiveness of kernel-based learning parameters. For retrieval, a robust but limited set of concept detectors necessitates the need to rely on as many auxiliary information channels as possible. Therefore, our automatic search experiments focus on predicting which information channel to trust given a certain topic, leading to a novel framework for predictive video retrieval. To improve the video retrieval results further, our interactive search experiments investigate the roles of visualizing preview results for a certain browse-dimension and active learning mechanisms that learn to solve complex search topics by analysis from user browsing behavior. The 2008 edition of the TRECVID benchmark has been the most successful MediaMill participation to date, resulting in the top ranking for both concept detection and interactive search, and a runner-up ranking for automatic retrieval. Again a lot has been learned during this year's TRECVID campaign; we highlight the most important lessons at the end of this paper. }
    }
  17. Cees G. M. Snoek, Ivo Everts, Jan C. van Gemert, Jan-Mark Geusebroek, Bouke Huurnink, Dennis C. Koelma, Michiel van Liempt, Ork de Rooij, Koen E. A. van de Sande, Arnold W. M. Smeulders, Jasper R. R. Uijlings, and Marcel Worring, "The MediaMill TRECVID 2007 Semantic Video Search Engine," in Proceedings of the 5th TRECVID Workshop, Gaithersburg, USA, 2007.
    @INPROCEEDINGS{SnoekTRECVID07,
      author = {Cees G. M. Snoek and Ivo Everts and Jan C. van Gemert and Jan-Mark Geusebroek and Bouke Huurnink and Dennis C. Koelma and Michiel van Liempt and Ork de Rooij and Koen E. A. van de Sande and Arnold W. M. Smeulders and Jasper R. R. Uijlings and Marcel Worring},
      title = {The {MediaMill} {TRECVID} 2007 Semantic Video Search Engine},
      booktitle = {Proceedings of the 5th TRECVID Workshop},
      pages = {},
      month = {November},
      year = {2007},
      address = {Gaithersburg, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mediamill-TRECVID2007-final.pdf},
      abstract = { In this paper we describe our TRECVID 2007 experiments. The MediaMill team participated in two tasks: concept detection and search. For concept detection we extract region-based image features, on grid, keypoint, and segmentation level, which we combine with various supervised learners. In addition, we explore the utility of temporal image features. A late fusion approach of all region-based analysis methods using geometric mean was our most successful run. What is more, using MediaMill Challenge and LSCOM annotations, our visual-only approach generalizes to a set of 572 concept detectors. To handle such a large thesaurus in retrieval, an engine is developed which automatically selects a set of relevant concept detectors based on text matching, ontology querying, and visual concept likelihood. The suggestion engine is evaluated as part of the automatic search task and forms the entry point for our interactive search experiments. For this task we experiment with two browsers for interactive exploration: the well-known CrossBrowser and the novel ForkBrowser. It was found that, while retrieval performance varies substantially per topic, the ForkBrowser is able to produce the same overall results as the CrossBrowser. However, the ForkBrowser obtains top-performance for most topics with less user interaction. Indicating the potential of this browser for interactive search. Similar to previous years our best interactive search runs yield high overall performance, ranking 3rd and 4th. }
    }
  18. Cees G. M. Snoek, Jan C. van Gemert, Theo Gevers, Bouke Huurnink, Dennis C. Koelma, Michiel van Liempt, Ork de Rooij, Koen E. A. van de Sande, Frank J. Seinstra, Arnold W. M. Smeulders, Andrew H. C. Thean, Cor J. Veenman, and Marcel Worring, "The MediaMill TRECVID 2006 Semantic Video Search Engine," in Proceedings of the 4th TRECVID Workshop, Gaithersburg, USA, 2006.
    @INPROCEEDINGS{SnoekTRECVID06,
      author = {Cees G. M. Snoek and Jan C. van Gemert and Theo Gevers and Bouke Huurnink and Dennis C. Koelma and Michiel van Liempt and Ork de Rooij and Koen E. A. van de Sande and Frank J. Seinstra and Arnold W. M. Smeulders and Andrew H. C. Thean and Cor J. Veenman and Marcel Worring},
      title = {The {MediaMill} {TRECVID} 2006 Semantic Video Search Engine},
      booktitle = {Proceedings of the 4th TRECVID Workshop},
      pages = {},
      month = {November},
      year = {2006},
      address = {Gaithersburg, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/mediamill-TRECVID2006-final.pdf},
      abstract = { In this paper we describe our TRECVID 2006 experiments. The MediaMill team participated in two tasks: concept detection and search. For concept detection we use the MediaMill Challenge as experimental platform. The MediaMill Challenge divides the generic video indexing problem into a visual-only, textual-only, early fusion, late fusion, and combined analysis experiment. We provide a baseline implementation for each experiment together with baseline results, which we made available for the TRECVID community. The Challenge package was downloaded more than 80 times and we anticipate that it has been used by several teams for their 2006 submission. Our Challenge experiments focus specifically on visual-only analysis of video (run id: B\_MM). We extract image features, on global, regional, and keypoint level, which we combine with various supervised learners. A late fusion approach of visual-only analysis methods using geometric mean was our most successful run. With this run we conquer the Challenge baseline by more than 50\%. Our concept detection experiments have resulted in the best score for three concepts: i.e. \emph{desert},
      \emph{flag us},
      and \emph{charts}. What is more, using LSCOM annotations, our visual-only approach generalizes well to a set of 491 concept detectors. To handle such a large thesaurus in retrieval, an engine is developed which automatically selects a set of relevant concept detectors based on text matching and ontology querying. The suggestion engine is evaluated as part of the automatic search task (run id: A-MM) and forms the entry point for our interactive search experiments (run id: A-MM). Here we experiment with query by object matching and two browsers for interactive exploration: the CrossBrowser and the novel NovaBrowser. It was found that the NovaBrowser is able to produce the same results as the CrossBrowser, but with less user interaction. Similar to previous years our best interactive search runs yield top performance, ranking 2nd and 6th overall. Again a lot has been learned during this year's TRECVID campaign, we highlight the most important lessons at the end of this paper. }
    }
  19. Cees G. M. Snoek, Jan C. van Gemert, Jan-Mark Geusebroek, Bouke Huurnink, Dennis C. Koelma, Giang P. Nguyen, Ork de Rooij, Frank J. Seinstra, Arnold W. M. Smeulders, Cor J. Veenman, and Marcel Worring, "The MediaMill TRECVID 2005 Semantic Video Search Engine," in Proceedings of the 3rd TRECVID Workshop, Gaithersburg, USA, 2005.
    @INPROCEEDINGS{SnoekTRECVID05,
      author = {Cees G. M. Snoek and Jan C. van Gemert and Jan-Mark Geusebroek and Bouke Huurnink and Dennis C. Koelma and Giang P. Nguyen and Ork de Rooij and Frank J. Seinstra and Arnold W. M. Smeulders and Cor J. Veenman and Marcel Worring},
      title = {The {MediaMill} {TRECVID} 2005 Semantic Video Search Engine},
      booktitle = {Proceedings of the 3rd TRECVID Workshop},
      pages = {},
      month = {November},
      year = {2005},
      address = {Gaithersburg, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/UvA-MM_TRECVID2005.pdf},
      abstract = { In this paper we describe our TRECVID 2005 experiments. The UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A\_CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information. Experiments indicate that average precision results increase drastically, especially for pan (+51\%) and tilt (+28\%). For concept detection we propose a generic approach using our semantic pathfinder. Most important novelty compared to last years system is the improved visual analysis using proto-concepts based on Wiccest features. In addition, the path selection mechanism was extended. Based on the semantic pathfinder architecture we are currently able to detect an unprecedented lexicon of 101 semantic concepts in a generic fashion. We performed a large set of experiments (runid: B\_vA). The results show that an optimal strategy for generic multimedia analysis is one that learns from the training set on a per-concept basis which tactic to follow. Experiments also indicate that our visual analysis approach is highly promising. The lexicon of 101 semantic concepts forms the basis for our search experiments (runid: B\_2\_A-MM). We participated in automatic, manual (using only visual information), and interactive search. The lexicon-driven retrieval paradigm aids substantially in all search tasks. When coupled with interaction, exploiting several novel browsing schemes of our semantic video search engine, results are excellent. We obtain a top-3 result for 19 out of 24 search topics. In addition, we obtain the highest mean average precision of all search participants. We exploited the technology developed for the above tasks to explore the BBC rushes. Most intriguing result is that from the lexicon of 101 visual-only models trained for news data 25 concepts perform reasonably well on BBC data also. }
    }
  20. Cees G. M. Snoek, Marcel Worring, Jan-Mark Geusebroek, Dennis C. Koelma, and Frank J. Seinstra, "The MediaMill TRECVID 2004 Semantic Video Search Engine," in Proceedings of the 2nd TRECVID Workshop, Gaithersburg, USA, 2004.
    @INPROCEEDINGS{SnoekTRECVID04,
      author = {Cees G. M. Snoek and Marcel Worring and Jan-Mark Geusebroek and Dennis C. Koelma and Frank J. Seinstra},
      title = {The {MediaMill} {TRECVID} 2004 Semantic Video Search Engine},
      booktitle = {Proceedings of the 2nd TRECVID Workshop},
      pages = {},
      month = {November},
      year = {2004},
      address = {Gaithersburg, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/UvA-MM_TRECVID2004.pdf},
      abstract = { This year the UvA-MediaMill team participated in the Feature Extraction and Search Task. We developed a generic approach for semantic concept classification using the semantic value chain. The semantic value chain extracts concepts from video documents based on three consecutive analysis links, named the content link, the style link, and the context link. Various experiments within the analysis links were performed, showing amongst others the merit of processing beyond key frames, the value of style elements, and the importance of learning semantic context. For all experiments a lexicon of 32 concepts was exploited, 10 of which are part of the Feature Extraction Task. Top three system-based ranking in 8 out of the 10 benchmark concepts indicates that our approach is very promising. Apart from this, the lexicon of 32 concepts proved very useful in an interactive search scenario with our semantic video search engine, where we obtained the highest mean average precision of all participants. }
    }
  21. Alexander Hauptmann, Robert V. Baron, Ming-Yu Chen, Michael Christel, Pinar Duygulu, Chang Huang, Rong Jin, Wei-Hao Lin, Dorbin Ng, Neema Moraveji, Norman Papernick, Cees G. M. Snoek, George Tzanetakis, Jun Yang, Rong Yan, and Howard D. Wactlar, "Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video," in Proceedings of the 1st TRECVID Workshop, Gaithersburg, USA, 2003.
    @INPROCEEDINGS{HauptmannTRECVID03,
      author = {Alexander Hauptmann and Robert V. Baron and Ming-Yu Chen and Michael Christel and Pinar Duygulu and Chang Huang and Rong Jin and Wei-Hao Lin and Dorbin Ng and Neema Moraveji and Norman Papernick and Cees G. M. Snoek and George Tzanetakis and Jun Yang and Rong Yan and Howard D. Wactlar},
      title = {Informedia at {TRECVID 2003}: Analyzing and Searching Broadcast News Video},
      booktitle = {Proceedings of the 1st TRECVID Workshop},
      pages = {},
      month = {November},
      year = {2003},
      address = {Gaithersburg, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/TREC03Informedia.pdf},
      abstract = {}
    }
  22. Jeroen Vendrig, Jurgen den Hartog, David van Leeuwen, Ioannis Patras, Stephan Raaijmakers, Jeroen van Rest, Cees G. M. Snoek, and Marcel Worring, "TREC Feature Extraction by Active Learning," in Proceedings of the 11th Text Retrieval Conference, Gaithersburg, USA, 2002.
    @INPROCEEDINGS{VendrigTREC02,
      author = {Jeroen Vendrig and Jurgen den Hartog and David van Leeuwen and Ioannis Patras and Stephan Raaijmakers and Jeroen van Rest and Cees G. M. Snoek and Marcel Worring},
      title = {{TREC} Feature Extraction by Active Learning},
      booktitle = {Proceedings of the 11th Text Retrieval Conference},
      month = {November},
      year = {2002},
      address = {Gaithersburg, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/trec2002.pdf},
      abstract = {}
    }
  23. Jan Baan, Alex van Ballegooij, Jan-Mark Geusebroek, Djoerd Hiemstra, Jurgen den Hartog, Johan List, Cees G. M. Snoek, Ioannis Patras, Stephan Raaijmakers, Leon Todoran, Jeroen Vendrig, Arjen de Vries, Thijs Westerveld, and Marcel Worring, "Lazy Users and Automatic Video Retrieval Tools in (the) Lowlands," in Proceedings of the 10th Text Retrieval Conference, Gaithersburg, USA, 2001.
    @INPROCEEDINGS{BaanTREC01,
      author = {Jan Baan and Alex van Ballegooij and Jan-Mark Geusebroek and Djoerd Hiemstra and Jurgen den Hartog and Johan List and Cees G. M. Snoek and Ioannis Patras and Stephan Raaijmakers and Leon Todoran and Jeroen Vendrig and Arjen de Vries and Thijs Westerveld and Marcel Worring},
      title = {Lazy Users and Automatic Video Retrieval Tools in (the) Lowlands},
      booktitle = {Proceedings of the 10th Text Retrieval Conference},
      month = {November},
      year = {2001},
      address = {Gaithersburg, USA},
      pdf = {http://isis-data.science.uva.nl/cgmsnoek/pub/lowlands01.pdf},
      abstract = {}
    }
This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Comments are closed.