2017 |
|
![]() | Pascal Mettes, Cees G M Snoek, Shih-Fu Chang: Localizing Actions from Video Labels and Pseudo-Annotations. BMVC, London, UK, 2017. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{MettesBMVC17, title = {Localizing Actions from Video Labels and Pseudo-Annotations}, author = {Pascal Mettes and Cees G M Snoek and Shih-Fu Chang}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/mettes-pseudo-annotations-bmvc2017.pdf}, year = {2017}, date = {2017-09-01}, booktitle = {BMVC}, address = {London, UK}, abstract = {The goal of this paper is to determine the spatio-temporal location of actions in video. Where training from hard to obtain box annotations is the norm, we propose an intuitive and effective algorithm that localizes actions from their class label only. We are inspired by recent work showing that unsupervised action proposals selected with human point-supervision perform as well as using expensive box annotations. Rather than asking users to provide point supervision, we propose fully automatic visual cues that replace manual point annotations. We call the cues pseudo-annotations, introduce five of them, and propose a correlation metric for automatically selecting and combining them. Thorough evaluation on challenging action localization datasets shows that we reach results comparable to results with full box supervision. We also show that pseudo-annotations can be leveraged during testing to improve weakly- and strongly-supervised localizers.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } The goal of this paper is to determine the spatio-temporal location of actions in video. Where training from hard to obtain box annotations is the norm, we propose an intuitive and effective algorithm that localizes actions from their class label only. We are inspired by recent work showing that unsupervised action proposals selected with human point-supervision perform as well as using expensive box annotations. Rather than asking users to provide point supervision, we propose fully automatic visual cues that replace manual point annotations. We call the cues pseudo-annotations, introduce five of them, and propose a correlation metric for automatically selecting and combining them. Thorough evaluation on challenging action localization datasets shows that we reach results comparable to results with full box supervision. We also show that pseudo-annotations can be leveraged during testing to improve weakly- and strongly-supervised localizers. |
![]() | Mihir Jain, Jan C van Gemert, Hervé Jégou, Patrick Bouthemy, Cees G M Snoek: Tubelets: Unsupervised Action Proposals from Spatiotemporal Super-voxels. International Journal of Computer Vision, 124 (3), pp. 287–311, 2017. (Type: Journal Article | Abstract | Links | BibTeX) @article{JainIJCV17, title = {Tubelets: Unsupervised Action Proposals from Spatiotemporal Super-voxels}, author = {Mihir Jain and Jan C van Gemert and Hervé Jégou and Patrick Bouthemy and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/jain-tubelets-ijcv.pdf}, year = {2017}, date = {2017-09-01}, journal = {International Journal of Computer Vision}, volume = {124}, number = {3}, pages = {287--311}, abstract = {This paper considers the problem of localizing actions in videos as sequences of bounding boxes. The objective is to generate action proposals that are likely to include the action of interest, ideally achieving high recall with few proposals. Our contributions are threefold. First, inspired by selective search for object proposals, we introduce an approach to generate action proposals from spatiotemporal super-voxels in an unsupervised manner, we call them Tubelets. Second, along with the static features from individual frames our approach advantageously exploits motion. We introduce independent motion evidence as a feature to characterize how the action deviates from the background and explicitly incorporate such motion information in various stages of the proposal generation. Finally, we introduce spatiotemporal refinement of Tubelets, for more precise localization of actions, and pruning to keep the number of Tubelets limited. We demonstrate the suitability of our approach by extensive experiments for action proposal quality and action localization on three public datasets: UCF Sports, MSR-II and UCF101. For action proposal quality, our unsupervised proposals beat all other existing approaches on the three datasets. For action localization, we show top performance on both the trimmed videos of UCF Sports and UCF101 as well as the untrimmed videos of MSR-II.}, keywords = {}, pubstate = {published}, tppubtype = {article} } This paper considers the problem of localizing actions in videos as sequences of bounding boxes. The objective is to generate action proposals that are likely to include the action of interest, ideally achieving high recall with few proposals. Our contributions are threefold. First, inspired by selective search for object proposals, we introduce an approach to generate action proposals from spatiotemporal super-voxels in an unsupervised manner, we call them Tubelets. Second, along with the static features from individual frames our approach advantageously exploits motion. We introduce independent motion evidence as a feature to characterize how the action deviates from the background and explicitly incorporate such motion information in various stages of the proposal generation. Finally, we introduce spatiotemporal refinement of Tubelets, for more precise localization of actions, and pruning to keep the number of Tubelets limited. We demonstrate the suitability of our approach by extensive experiments for action proposal quality and action localization on three public datasets: UCF Sports, MSR-II and UCF101. For action proposal quality, our unsupervised proposals beat all other existing approaches on the three datasets. For action localization, we show top performance on both the trimmed videos of UCF Sports and UCF101 as well as the untrimmed videos of MSR-II. |
Jingkuan Song, Hervé Jégou, Cees Snoek, Qi Tian, Nicu Sebe: Guest Editorial: Large-Scale Multimedia Data Retrieval, Classification, and Understanding. IEEE Transactions on Multimedia, 19 (9), pp. 1965-1967, 2017. (Type: Journal Article | Abstract | BibTeX) @article{SongTMM17, title = {Guest Editorial: Large-Scale Multimedia Data Retrieval, Classification, and Understanding}, author = {Jingkuan Song and Hervé Jégou and Cees Snoek and Qi Tian and Nicu Sebe}, year = {2017}, date = {2017-09-01}, journal = {IEEE Transactions on Multimedia}, volume = {19}, number = {9}, pages = {1965-1967}, abstract = {The papers in this special section focus on multimedia data retrieval and classification via large-scale systems. Today, large collections of multimedia data are explosively created in different fields and have attracted increasing interest in the multimedia research area. Large-scale multimedia data provide great unprecedented opportunities to address many challenging research problems, e.g., enabling generic visual classification to bridge the well-known semantic gap by exploring large-scale data, offering a promising possibility for in-depth multimedia understanding, as well as discerning patterns and making better decisions by analyzing the large pool of data. Therefore, the techniques for large-scale multimedia retrieval, classification, and understanding are highly desired. Simultaneously, the explosion of multimedia data puts urgent needs for more sophisticated and robust models and algorithms to retrieve, classify, and understand these data. Another interesting challenge is, how can the traditional machine learning algorithms be scaled up to millions and even billions of items with thousands of dimensionalities? This motivated the community to design parallel and distributed machine learning platforms, exploiting GPUs as well as developing practical algorithms. Besides, it is also important to exploit the commonalities and differences between different tasks, e.g., image retrieval and classification have much in common while different indexing methods evolve in a mutually supporting way.}, keywords = {}, pubstate = {published}, tppubtype = {article} } The papers in this special section focus on multimedia data retrieval and classification via large-scale systems. Today, large collections of multimedia data are explosively created in different fields and have attracted increasing interest in the multimedia research area. Large-scale multimedia data provide great unprecedented opportunities to address many challenging research problems, e.g., enabling generic visual classification to bridge the well-known semantic gap by exploring large-scale data, offering a promising possibility for in-depth multimedia understanding, as well as discerning patterns and making better decisions by analyzing the large pool of data. Therefore, the techniques for large-scale multimedia retrieval, classification, and understanding are highly desired. Simultaneously, the explosion of multimedia data puts urgent needs for more sophisticated and robust models and algorithms to retrieve, classify, and understand these data. Another interesting challenge is, how can the traditional machine learning algorithms be scaled up to millions and even billions of items with thousands of dimensionalities? This motivated the community to design parallel and distributed machine learning platforms, exploiting GPUs as well as developing practical algorithms. Besides, it is also important to exploit the commonalities and differences between different tasks, e.g., image retrieval and classification have much in common while different indexing methods evolve in a mutually supporting way. | |
![]() | Zhenyang Li, Ran Tao, Efstratios Gavves, Cees G M Snoek, Arnold W M Smeulders: Tracking by Natural Language Specification. CVPR, Honolulu, USA, 2017. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{LiCVPR17, title = {Tracking by Natural Language Specification}, author = {Zhenyang Li and Ran Tao and Efstratios Gavves and Cees G M Snoek and Arnold W M Smeulders}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-tracking-language-cvpr2017.pdf}, year = {2017}, date = {2017-07-01}, booktitle = {CVPR}, address = {Honolulu, USA}, abstract = {This paper strives to track a target object in a video. Rather than specifying the target in the first frame of a video by a bounding box, we propose to track the object based on a natural language specification of the target, which provides a more natural human-machine interaction as well as a means to improve tracking results. We define three variants of tracking by language specification: one relying on lingual target specification only, one relying on visual target specification based on language, and one leveraging their joint capacity. To show the potential of tracking by natural language specification we extend two popular tracking datasets with lingual descriptions and report experiments. Finally, we also sketch new tracking scenarios in surveillance and other live video streams that become feasible with a lingual specification of the target.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } This paper strives to track a target object in a video. Rather than specifying the target in the first frame of a video by a bounding box, we propose to track the object based on a natural language specification of the target, which provides a more natural human-machine interaction as well as a means to improve tracking results. We define three variants of tracking by language specification: one relying on lingual target specification only, one relying on visual target specification based on language, and one leveraging their joint capacity. To show the potential of tracking by natural language specification we extend two popular tracking datasets with lingual descriptions and report experiments. Finally, we also sketch new tracking scenarios in surveillance and other live video streams that become feasible with a lingual specification of the target. |
![]() | Thomas Mensink, Thomas Jongstra, Pascal Mettes, Cees G M Snoek: Music-Guided Video Summarization using Quadratic Assignments. ICMR, Bucharest, Romania, 2017. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{MensinkICMR17, title = {Music-Guided Video Summarization using Quadratic Assignments}, author = {Thomas Mensink and Thomas Jongstra and Pascal Mettes and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/mensink-music-video-summarization-icmr2017.pdf http://isis-data.science.uva.nl/cgmsnoek/pub/mensink-music-video-summarization-icmr2017.mp4}, year = {2017}, date = {2017-06-01}, booktitle = {ICMR}, address = {Bucharest, Romania}, abstract = {This paper aims to automatically generate a summary of an unedited video, guided by an externally provided music-track. The tempo, energy and beats in the music determine the choices and cuts in the video summarization. To solve this challenging task, we model video summarization as a quadratic assignment problem. We assign frames to the summary, using rewards based on frame interestingness, plot coherency, audio-visual match, and cut properties. Experimentally we validate our approach on the SumMe dataset. The results show that our music guided summaries are more appealing, and even outperform the current state-of-the-art summarization methods when evaluated on the F1 measure of precision and recall.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } This paper aims to automatically generate a summary of an unedited video, guided by an externally provided music-track. The tempo, energy and beats in the music determine the choices and cuts in the video summarization. To solve this challenging task, we model video summarization as a quadratic assignment problem. We assign frames to the summary, using rewards based on frame interestingness, plot coherency, audio-visual match, and cut properties. Experimentally we validate our approach on the SumMe dataset. The results show that our music guided summaries are more appealing, and even outperform the current state-of-the-art summarization methods when evaluated on the F1 measure of precision and recall. |
Bernard Ghanem, Juan Carlos Niebles, Cees Snoek, Fabian Caba Heilbron, Humam Alwassel, Ranjay Khrisna, Victor Escorcia, Kenji Hata, Shyamal Buch: ActivityNet Challenge 2017 Summary. arXiv:1710.08011, 2017. (Type: Journal Article | Abstract | BibTeX) @article{GhanemACTIVITY17, title = {ActivityNet Challenge 2017 Summary}, author = {Bernard Ghanem and Juan Carlos Niebles and Cees Snoek and Fabian Caba Heilbron and Humam Alwassel and Ranjay Khrisna and Victor Escorcia and Kenji Hata and Shyamal Buch}, year = {2017}, date = {2017-01-01}, journal = {arXiv:1710.08011}, abstract = {The ActivityNet Large Scale Activity Recognition Challenge 2017 Summary: results and challenge participants papers.}, keywords = {}, pubstate = {published}, tppubtype = {article} } The ActivityNet Large Scale Activity Recognition Challenge 2017 Summary: results and challenge participants papers. | |
2016 |
|
![]() | Rama Kovvuri, Ram Nevatia, Cees G M Snoek: Segment-based Models for Event Detection and Recounting. ICPR, Cancun, Mexico, 2016. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{KovvuriICPR16, title = {Segment-based Models for Event Detection and Recounting}, author = {Rama Kovvuri and Ram Nevatia and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/kovvuri-segment-models-icpr2016.pdf}, year = {2016}, date = {2016-12-01}, booktitle = {ICPR}, address = {Cancun, Mexico}, abstract = {We present a novel approach towards web video classification and recounting that uses video segments to model an event. This approach overcomes the limitations faced by the classical video-level models such as modeling semantics, identifying informative segments in a video and background segment suppression. We posit that segment-based models are able to identify both the frequently-occurring and rarer patterns in an event effectively, despite being trained on only a fraction of the training data. Our framework employs a discriminative approach to optimize our models in distributed and data-driven fashion while maintaining semantic interpretability. We evaluate the effectiveness of our approach on the challenging TRECVID MEDTest 2014 dataset. We demonstrate improvements in recounting and classification, particularly in events characterized by inherent intra-class variations.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } We present a novel approach towards web video classification and recounting that uses video segments to model an event. This approach overcomes the limitations faced by the classical video-level models such as modeling semantics, identifying informative segments in a video and background segment suppression. We posit that segment-based models are able to identify both the frequently-occurring and rarer patterns in an event effectively, despite being trained on only a fraction of the training data. Our framework employs a discriminative approach to optimize our models in distributed and data-driven fashion while maintaining semantic interpretability. We evaluate the effectiveness of our approach on the challenging TRECVID MEDTest 2014 dataset. We demonstrate improvements in recounting and classification, particularly in events characterized by inherent intra-class variations. |
![]() | Pascal Mettes, Jan C van Gemert, Cees G M Snoek: No Spare Parts: Sharing Part Detectors for Image Categorization. Computer Vision and Image Understanding, 152 , pp. 131–141, 2016. (Type: Journal Article | Abstract | Links | BibTeX) @article{MettesCVIU16, title = {No Spare Parts: Sharing Part Detectors for Image Categorization}, author = {Pascal Mettes and Jan C van Gemert and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/mettes-spare-parts-cviu.pdf}, year = {2016}, date = {2016-11-01}, journal = {Computer Vision and Image Understanding}, volume = {152}, pages = {131--141}, abstract = {This work aims for image categorization by learning a representation of discriminative parts. Different from most existing part-based methods, we argue that parts are naturally shared between image categories and should be modeled as such. We motivate our approach with a quantitative and qualitative analysis by backtracking where selected parts come from. Our analysis shows that in addition to the category parts defining the category, the parts coming from the background context and parts from other image categories improve categorization performance. Part selection should not be done separately for each category, but instead be shared and optimized over all categories. To incorporate part sharing between categories, we present an algorithm based on AdaBoost to optimize part sharing and selection, as well as fusion with the global image representation. With a single algorithm and without the need for task-specific optimization, we achieve results competitive to the state-of-the-art on object, scene, and action categories, further improving over deep convolutional neural networks and alternative part representations.}, keywords = {}, pubstate = {published}, tppubtype = {article} } This work aims for image categorization by learning a representation of discriminative parts. Different from most existing part-based methods, we argue that parts are naturally shared between image categories and should be modeled as such. We motivate our approach with a quantitative and qualitative analysis by backtracking where selected parts come from. Our analysis shows that in addition to the category parts defining the category, the parts coming from the background context and parts from other image categories improve categorization performance. Part selection should not be done separately for each category, but instead be shared and optimized over all categories. To incorporate part sharing between categories, we present an algorithm based on AdaBoost to optimize part sharing and selection, as well as fusion with the global image representation. With a single algorithm and without the need for task-specific optimization, we achieve results competitive to the state-of-the-art on object, scene, and action categories, further improving over deep convolutional neural networks and alternative part representations. |
![]() | Cees G M Snoek, Jianfeng Dong, Xirong Li, Xiaoxu Wang, Qijie Wei, Weiyu Lan, Efstratios Gavves, Noureldien Hussein, Dennis C Koelma, Arnold W M Smeulders: University of Amsterdam and Renmin University at TRECVID 2016: Searching Video, Detecting Events and Describing Video. TRECVID, Gaithersburg, USA, 2016. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{SnoekTRECVID16, title = {University of Amsterdam and Renmin University at TRECVID 2016: Searching Video, Detecting Events and Describing Video}, author = {Cees G M Snoek and Jianfeng Dong and Xirong Li and Xiaoxu Wang and Qijie Wei and Weiyu Lan and Efstratios Gavves and Noureldien Hussein and Dennis C Koelma and Arnold W M Smeulders}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/mediamill-TRECVID2016-final.pdf}, year = {2016}, date = {2016-11-01}, booktitle = {TRECVID}, address = {Gaithersburg, USA}, abstract = {In this paper we summarize our TRECVID 2016 video recognition experiments. We participated in three tasks: video search, event detection and video description. Here we describe the tasks on event detection and video description. For event detection we explore semantic representations based on VideoStory and an ImageNet Shuffle for both zero-shot and few-example regimes. For the showcase task on video description we experiment with a deep network that predicts a visual representation from a natural language description, and use this space for the sentence matching. For generative description we enhance a neural image captioning model with Early Embedding and Late Reranking. The 2016 edition of the TRECVID benchmark has been a fruitful participation for our joint-team, resulting in the best overall result for zero- and few-example event detection as well as video description by matching and in generative mode.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } In this paper we summarize our TRECVID 2016 video recognition experiments. We participated in three tasks: video search, event detection and video description. Here we describe the tasks on event detection and video description. For event detection we explore semantic representations based on VideoStory and an ImageNet Shuffle for both zero-shot and few-example regimes. For the showcase task on video description we experiment with a deep network that predicts a visual representation from a natural language description, and use this space for the sentence matching. For generative description we enhance a neural image captioning model with Early Embedding and Late Reranking. The 2016 edition of the TRECVID benchmark has been a fruitful participation for our joint-team, resulting in the best overall result for zero- and few-example event detection as well as video description by matching and in generative mode. |
![]() | Jianfeng Dong, Xirong Li, Weiyu Lan, Yujia Huo, Cees G M Snoek: Early Embedding and Late Reranking for Video Captioning. MM, Amsterdam, The Netherlands, 2016, (Multimedia Grand Challenge winner). (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{DongMM16, title = {Early Embedding and Late Reranking for Video Captioning}, author = {Jianfeng Dong and Xirong Li and Weiyu Lan and Yujia Huo and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/dong-captioning-mm2016.pdf}, year = {2016}, date = {2016-10-01}, booktitle = {MM}, address = {Amsterdam, The Netherlands}, abstract = {This paper describes our solution for the MSR Video to Language Challenge. We start from the popular ConvNet + LSTM model, which we extend with two novel modules. One is early embedding, which enriches the current low-level input to LSTM by tag embeddings. The other is late reranking, for re-scoring generated sentences in terms of their relevance to a specific video. The modules are inspired by recent works on image captioning, repurposed and redesigned for video. As experiments on the MSR-VTT validation set show, the joint use of these two modules add a clear improvement over a non-trivial ConvNet + LSTM baseline under four performance metrics. The viability of the proposed solution is further confirmed by the blind test by the organizers. Our system is ranked at the 4th place in terms of overall performance, while scoring the best CIDEr-D, which measures the human-likeness of generated captions.}, note = {Multimedia Grand Challenge winner}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } This paper describes our solution for the MSR Video to Language Challenge. We start from the popular ConvNet + LSTM model, which we extend with two novel modules. One is early embedding, which enriches the current low-level input to LSTM by tag embeddings. The other is late reranking, for re-scoring generated sentences in terms of their relevance to a specific video. The modules are inspired by recent works on image captioning, repurposed and redesigned for video. As experiments on the MSR-VTT validation set show, the joint use of these two modules add a clear improvement over a non-trivial ConvNet + LSTM baseline under four performance metrics. The viability of the proposed solution is further confirmed by the blind test by the organizers. Our system is ranked at the 4th place in terms of overall performance, while scoring the best CIDEr-D, which measures the human-likeness of generated captions. |
![]() | Roeland De Geest, Efstratios Gavves, Amir Ghodrati, Zhenyang Li, Cees G M Snoek, Tinne Tuytelaars: Online Action Detection. ECCV, Amsterdam, The Netherlands, 2016. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{GeestECCV16, title = {Online Action Detection}, author = {Roeland De Geest and Efstratios Gavves and Amir Ghodrati and Zhenyang Li and Cees G M Snoek and Tinne Tuytelaars}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/geest-online-action-eccv2016.pdf https://homes.esat.kuleuven.be/psi-archive/rdegeest/TVSeries.html}, year = {2016}, date = {2016-10-01}, booktitle = {ECCV}, address = {Amsterdam, The Netherlands}, abstract = {In online action detection, the goal is to detect the start of an action in a video stream as soon as it happens. For instance, if a child is chasing a ball, an autonomous car should recognize what is going on and respond immediately. This is a very challenging problem for four reasons. First, only partial actions are observed. Second, there is a large variability in negative data. Third, the start of the action is unknown, so it is unclear over what time window the information should be integrated. Finally, in real world data, large within-class variability exists. This problem has been addressed before, but only to some extent. Our contributions to online action detection are threefold. First, we introduce a realistic dataset composed of 27 episodes from 6 popular TV series. The dataset spans over 16 hours of footage annotated with 30 action classes, totaling 6,231 action instances. Second, we analyze and compare various baseline methods, showing this is a challenging problem for which none of the methods provides a good solution. Third, we analyze the change in performance when there is a variation in viewpoint, occlusion, truncation, etc. We introduce an evaluation protocol for fair comparison. The dataset, the baselines and the models will all be made publicly available to encourage (much needed) further research on online action detection on realistic data.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } In online action detection, the goal is to detect the start of an action in a video stream as soon as it happens. For instance, if a child is chasing a ball, an autonomous car should recognize what is going on and respond immediately. This is a very challenging problem for four reasons. First, only partial actions are observed. Second, there is a large variability in negative data. Third, the start of the action is unknown, so it is unclear over what time window the information should be integrated. Finally, in real world data, large within-class variability exists. This problem has been addressed before, but only to some extent. Our contributions to online action detection are threefold. First, we introduce a realistic dataset composed of 27 episodes from 6 popular TV series. The dataset spans over 16 hours of footage annotated with 30 action classes, totaling 6,231 action instances. Second, we analyze and compare various baseline methods, showing this is a challenging problem for which none of the methods provides a good solution. Third, we analyze the change in performance when there is a variation in viewpoint, occlusion, truncation, etc. We introduce an evaluation protocol for fair comparison. The dataset, the baselines and the models will all be made publicly available to encourage (much needed) further research on online action detection on realistic data. |
![]() | Pascal Mettes, Jan C van Gemert, Cees G M Snoek: Spot On: Action Localization from Pointly-Supervised Proposals. ECCV, Amsterdam, The Netherlands, 2016, (Oral presentation, top 1.8%). (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{MettesECCV16, title = {Spot On: Action Localization from Pointly-Supervised Proposals}, author = {Pascal Mettes and Jan C van Gemert and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/mettes-pointly-eccv2016.pdf http://isis-data.science.uva.nl/mettes/hollywood2tubes.tar.gz}, year = {2016}, date = {2016-10-01}, booktitle = {ECCV}, address = {Amsterdam, The Netherlands}, abstract = {We strive for spatio-temporal localization of actions in videos. The state-of-the-art relies on action proposals at test time and selects the best one with a classifier demanding carefully annotated box annotations at train time. Annotating action boxes in video is cumbersome, tedious, and error prone. Rather than annotating boxes, we propose to annotate actions in video with points on a sparse subset of frames only. We introduce an overlap measure between action proposals and points and incorporate them all into the objective of a non-convex Multiple Instance Learning optimization. Experimental evaluation on the UCF Sports and UCF 101 datasets shows that (i) spatio-temporal proposals can be used to train classifiers while retaining the localization performance, (ii) point annotations yield results comparable to box annotations while being significantly faster to annotate, (iii) with a minimum amount of supervision our approach is competitive to the state-of-the-art. Finally, we introduce spatio-temporal action annotations on the train and test videos of Hollywood2, resulting in Hollywood2Tubes, available at tinyurl.com/hollywood2tubes.}, note = {Oral presentation, top 1.8%}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } We strive for spatio-temporal localization of actions in videos. The state-of-the-art relies on action proposals at test time and selects the best one with a classifier demanding carefully annotated box annotations at train time. Annotating action boxes in video is cumbersome, tedious, and error prone. Rather than annotating boxes, we propose to annotate actions in video with points on a sparse subset of frames only. We introduce an overlap measure between action proposals and points and incorporate them all into the objective of a non-convex Multiple Instance Learning optimization. Experimental evaluation on the UCF Sports and UCF 101 datasets shows that (i) spatio-temporal proposals can be used to train classifiers while retaining the localization performance, (ii) point annotations yield results comparable to box annotations while being significantly faster to annotate, (iii) with a minimum amount of supervision our approach is competitive to the state-of-the-art. Finally, we introduce spatio-temporal action annotations on the train and test videos of Hollywood2, resulting in Hollywood2Tubes, available at tinyurl.com/hollywood2tubes. |
![]() | Spencer Cappallo, Thomas Mensink, Cees G M Snoek: Video Stream Retrieval of Unseen Queries using Semantic Memory. BMVC, York, UK, 2016. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{CappalloBMVC16, title = {Video Stream Retrieval of Unseen Queries using Semantic Memory}, author = {Spencer Cappallo and Thomas Mensink and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/cappallo-videostream-bmvc2016.pdf}, year = {2016}, date = {2016-09-01}, booktitle = {BMVC}, address = {York, UK}, abstract = {Retrieval of live, user-broadcast video streams is an under-addressed and increasingly relevant challenge. The on-line nature of the problem requires temporal evaluation and the unforeseeable scope of potential queries motivates an approach which can accommodate arbitrary search queries. To account for the breadth of possible queries, we adopt a no-example approach to query retrieval, which uses a query's semantic relatedness to pre-trained concept classifiers. To adapt to shifting video content, we propose memory pooling and memory welling methods that favor recent information over long past content. We identify two stream retrieval tasks, instantaneous retrieval at any particular time and continuous retrieval over a prolonged duration, and propose means for evaluating them. Three large scale video datasets are adapted to the challenge of stream retrieval. We report results for our search methods on the new stream retrieval tasks, as well as demonstrate their efficacy in a traditional, non-streaming video task.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Retrieval of live, user-broadcast video streams is an under-addressed and increasingly relevant challenge. The on-line nature of the problem requires temporal evaluation and the unforeseeable scope of potential queries motivates an approach which can accommodate arbitrary search queries. To account for the breadth of possible queries, we adopt a no-example approach to query retrieval, which uses a query's semantic relatedness to pre-trained concept classifiers. To adapt to shifting video content, we propose memory pooling and memory welling methods that favor recent information over long past content. We identify two stream retrieval tasks, instantaneous retrieval at any particular time and continuous retrieval over a prolonged duration, and propose means for evaluating them. Three large scale video datasets are adapted to the challenge of stream retrieval. We report results for our search methods on the new stream retrieval tasks, as well as demonstrate their efficacy in a traditional, non-streaming video task. |
![]() | Masoud Mazloom, Xirong Li, Cees G M Snoek: TagBook: A Semantic Video Representation without Supervision for Event Detection. IEEE Transactions on Multimedia, 18 (7), pp. 1378–1388, 2016. (Type: Journal Article | Abstract | Links | BibTeX) @article{MazloomTMM16, title = {TagBook: A Semantic Video Representation without Supervision for Event Detection}, author = {Masoud Mazloom and Xirong Li and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/mazloom-tagbook-tmm.pdf}, year = {2016}, date = {2016-07-01}, journal = {IEEE Transactions on Multimedia}, volume = {18}, number = {7}, pages = {1378--1388}, abstract = {We consider the problem of event detection in video for scenarios where only few, or even zero examples are available for training. For this challenging setting, the prevailing solutions in the literature rely on a semantic video representation obtained from thousands of pre-trained concept detectors. Different from existing work, we propose a new semantic video representation that is based on freely available social tagged videos only, without the need for training any intermediate concept detectors. We introduce a simple algorithm that propagates tags from a video's nearest neighbors, similar in spirit to the ones used for image retrieval, but redesign it for video event detection by including video source set refinement and varying the video tag assignment. We call our approach TagBook and study its construction, descriptiveness and detection performance on the TRECVID 2013 and 2014 multimedia event detection datasets and the Columbia Consumer Video dataset. Despite its simple nature, the proposed TagBook video representation is remarkably effective for few-example and zero-example event detection, even outperforming very recent state-of-the-art alternatives building on supervised representations.}, keywords = {}, pubstate = {published}, tppubtype = {article} } We consider the problem of event detection in video for scenarios where only few, or even zero examples are available for training. For this challenging setting, the prevailing solutions in the literature rely on a semantic video representation obtained from thousands of pre-trained concept detectors. Different from existing work, we propose a new semantic video representation that is based on freely available social tagged videos only, without the need for training any intermediate concept detectors. We introduce a simple algorithm that propagates tags from a video's nearest neighbors, similar in spirit to the ones used for image retrieval, but redesign it for video event detection by including video source set refinement and varying the video tag assignment. We call our approach TagBook and study its construction, descriptiveness and detection performance on the TRECVID 2013 and 2014 multimedia event detection datasets and the Columbia Consumer Video dataset. Despite its simple nature, the proposed TagBook video representation is remarkably effective for few-example and zero-example event detection, even outperforming very recent state-of-the-art alternatives building on supervised representations. |
![]() | Svetlana Kordumova, Thomas Mensink, Cees G M Snoek: Pooling Objects for Recognizing Scenes without Examples. ICMR, New York, USA, 2016, (Best paper award). (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{KordumovaICMR16, title = {Pooling Objects for Recognizing Scenes without Examples}, author = {Svetlana Kordumova and Thomas Mensink and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/kordumova-pooling-objects-icmr2016.pdf}, year = {2016}, date = {2016-06-01}, booktitle = {ICMR}, address = {New York, USA}, abstract = {In this paper we aim to recognize scenes in images without using any scene images as training data. Different from attribute based approaches, we do not carefully select the training classes to match the unseen scene classes. Instead, we propose a pooling over ten thousand of off-the-shelf object classifiers. To steer the knowledge transfer between objects and scenes we learn a semantic embedding with the aid of a large social multimedia corpus. Our key contributions are: we are the first to investigate pooling over ten thousand object classifiers to recognize scenes without examples; we explore the ontological hierarchy of objects and analyze the influence of object classifiers from different hierarchy levels; we exploit object positions in scene images and we demonstrate a new scene retrieval scenario with complex queries. Finally, we outperform attribute representations on two challenging scene datasets, SUNAttributes and Places2.}, note = {Best paper award}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } In this paper we aim to recognize scenes in images without using any scene images as training data. Different from attribute based approaches, we do not carefully select the training classes to match the unseen scene classes. Instead, we propose a pooling over ten thousand of off-the-shelf object classifiers. To steer the knowledge transfer between objects and scenes we learn a semantic embedding with the aid of a large social multimedia corpus. Our key contributions are: we are the first to investigate pooling over ten thousand object classifiers to recognize scenes without examples; we explore the ontological hierarchy of objects and analyze the influence of object classifiers from different hierarchy levels; we exploit object positions in scene images and we demonstrate a new scene retrieval scenario with complex queries. Finally, we outperform attribute representations on two challenging scene datasets, SUNAttributes and Places2. |
![]() | Pascal Mettes, Dennis Koelma, Cees G M Snoek: The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection. ICMR, New York, USA, 2016. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{MettesICMR16, title = {The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection}, author = {Pascal Mettes and Dennis Koelma and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/mettest-imagenetshuffle-icmr2016.pdf https://staff.fnwi.uva.nl/p.s.m.mettes/codedata.html}, year = {2016}, date = {2016-06-01}, booktitle = {ICMR}, address = {New York, USA}, abstract = {This paper strives for video event detection using a representation learned from deep convolutional neural networks. Different from the leading approaches, who all learn from the 1,000 classes defined in the ImageNet Large Scale Visual Recognition Challenge, we investigate how to leverage the complete ImageNet hierarchy for pre-training deep networks. To deal with the problems of over-specific classes and classes with few images, we introduce a bottom-up and top-down approach for reorganization of the ImageNet hierarchy based on all its 21,814 classes and more than 14 million images. Experiments on the TRECVID Multimedia Event Detection 2013 and 2015 datasets show that video representations derived from the layers of a deep neural network pre-trained with our reorganized hierarchy i) improves over standard pre-training, ii) is complementary among different reorganizations, iii) maintains the benefits of fusion with other modalities, and iv) leads to state-of-the-art event detection results. The reorganized hierarchies and their derived Caffe models are publicly available at http://tinyurl.com/imagenetshuffle.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } This paper strives for video event detection using a representation learned from deep convolutional neural networks. Different from the leading approaches, who all learn from the 1,000 classes defined in the ImageNet Large Scale Visual Recognition Challenge, we investigate how to leverage the complete ImageNet hierarchy for pre-training deep networks. To deal with the problems of over-specific classes and classes with few images, we introduce a bottom-up and top-down approach for reorganization of the ImageNet hierarchy based on all its 21,814 classes and more than 14 million images. Experiments on the TRECVID Multimedia Event Detection 2013 and 2015 datasets show that video representations derived from the layers of a deep neural network pre-trained with our reorganized hierarchy i) improves over standard pre-training, ii) is complementary among different reorganizations, iii) maintains the benefits of fusion with other modalities, and iv) leads to state-of-the-art event detection results. The reorganized hierarchies and their derived Caffe models are publicly available at http://tinyurl.com/imagenetshuffle. |
![]() | Xirong Li, Tiberio Uricchio, Lamberto Ballan, Marco Bertini, Cees G M Snoek, Alberto Del Bimbo: Socializing the Semantic Gap: A Comparative Survey on Image Tag Assignment, Refinement and Retrieval. ACM Computing Surveys, 49 (1), pp. 14:1–39, 2016. (Type: Journal Article | Abstract | Links | BibTeX) @article{LiCSUR16, title = {Socializing the Semantic Gap: A Comparative Survey on Image Tag Assignment, Refinement and Retrieval}, author = {Xirong Li and Tiberio Uricchio and Lamberto Ballan and Marco Bertini and Cees G M Snoek and Alberto Del Bimbo}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-survey-csur.pdf https://github.com/li-xirong/jingwei}, year = {2016}, date = {2016-06-01}, journal = {ACM Computing Surveys}, volume = {49}, number = {1}, pages = {14:1--39}, abstract = {Where previous reviews on content-based image retrieval emphasize what can be seen in an image to bridge the semantic gap, this survey considers what people tag about an image. A comprehensive treatise of three closely linked problems (i.e., image tag assignment, refinement, and tag-based image retrieval) is presented. While existing works vary in terms of their targeted tasks and methodology, they rely on the key functionality of tag relevance, that is, estimating the relevance of a specific tag with respect to the visual content of a given image and its social context. By analyzing what information a specific method exploits to construct its tag relevance function and how such information is exploited, this article introduces a two-dimensional taxonomy to structure the growing literature, understand the ingredients of the main works, clarify their connections and difference, and recognize their merits and limitations. For a head-to-head comparison with the state of the art, a new experimental protocol is presented, with training sets containing 10,000, 100,000, and 1 million images, and an evaluation on three test sets, contributed by various research groups. Eleven representative works are implemented and evaluated. Putting all this together, the survey aims to provide an overview of the past and foster progress for the near future.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Where previous reviews on content-based image retrieval emphasize what can be seen in an image to bridge the semantic gap, this survey considers what people tag about an image. A comprehensive treatise of three closely linked problems (i.e., image tag assignment, refinement, and tag-based image retrieval) is presented. While existing works vary in terms of their targeted tasks and methodology, they rely on the key functionality of tag relevance, that is, estimating the relevance of a specific tag with respect to the visual content of a given image and its social context. By analyzing what information a specific method exploits to construct its tag relevance function and how such information is exploited, this article introduces a two-dimensional taxonomy to structure the growing literature, understand the ingredients of the main works, clarify their connections and difference, and recognize their merits and limitations. For a head-to-head comparison with the state of the art, a new experimental protocol is presented, with training sets containing 10,000, 100,000, and 1 million images, and an evaluation on three test sets, contributed by various research groups. Eleven representative works are implemented and evaluated. Putting all this together, the survey aims to provide an overview of the past and foster progress for the near future. |
![]() | Henri Bal, Dick Epema, Cees de Laat, Rob van Nieuwpoort, John Romein, Frank Seinstra, Cees Snoek, Harry Wijshoff: A Medium-Scale Distributed System for Computer Science Research: Infrastructure for the Long Term. IEEE Computer, 49 (5), pp. 54–63, 2016. (Type: Journal Article | Abstract | Links | BibTeX) @article{BalCOM16, title = {A Medium-Scale Distributed System for Computer Science Research: Infrastructure for the Long Term}, author = {Henri Bal and Dick Epema and Cees de Laat and Rob van Nieuwpoort and John Romein and Frank Seinstra and Cees Snoek and Harry Wijshoff}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/bal-das-computer.pdf}, year = {2016}, date = {2016-05-01}, journal = {IEEE Computer}, volume = {49}, number = {5}, pages = {54--63}, abstract = {The Dutch Advanced School for Computing and Imaging has built five generations of a 200-node distributed system over nearly two decades while remaining aligned with the shifting computer science research agenda. The system has supported years of award-winning research, underlining the benefits of investing in a smaller-scale, tailored design.}, keywords = {}, pubstate = {published}, tppubtype = {article} } The Dutch Advanced School for Computing and Imaging has built five generations of a 200-node distributed system over nearly two decades while remaining aligned with the shifting computer science research agenda. The system has supported years of award-winning research, underlining the benefits of investing in a smaller-scale, tailored design. |
Luming Zhang, Rongrong Ji, Zhen Yi, Weisi Lin, Cees G M Snoek: Special issue on weakly supervised learning. Journal of Visual Communication and Image Representation, 37 , pp. 1–2, 2016. (Type: Journal Article | BibTeX) @article{ZhangJVCIR16, title = {Special issue on weakly supervised learning}, author = {Luming Zhang and Rongrong Ji and Zhen Yi and Weisi Lin and Cees G M Snoek}, year = {2016}, date = {2016-05-01}, journal = {Journal of Visual Communication and Image Representation}, volume = {37}, pages = {1--2}, keywords = {}, pubstate = {published}, tppubtype = {article} } | |
![]() | Arnav Agharwal, Rama Kovvuri, Ram Nevatia, Cees G M Snoek: Tag-based Video Retrieval by Embedding Semantic Content in a Continuous Word Space. WACV, pp. 1–8, Lake Placid, USA, 2016. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{AgharwalWACV16, title = {Tag-based Video Retrieval by Embedding Semantic Content in a Continuous Word Space}, author = {Arnav Agharwal and Rama Kovvuri and Ram Nevatia and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/agharwal-continuous-wacv2016.pdf}, year = {2016}, date = {2016-03-01}, booktitle = {WACV}, pages = {1--8}, address = {Lake Placid, USA}, abstract = {Content-based event retrieval in unconstrained web videos, based on query tags, is a hard problem due to large intra-class variances, and limited vocabulary and accuracy of the video concept detectors, creating a "semantic query gap". We present a technique to overcome this gap by using continuous word space representations to explicitly compute query and detector concept similarity. This not only allows for fast query-video similarity computation with implicit query expansion, but leads to a compact video representation, which allows implementation of a real-time retrieval system that can fit several thousand videos in a few hundred megabytes of memory. We evaluate the effectiveness of our representation on the challenging NIST MEDTest 2014 dataset.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Content-based event retrieval in unconstrained web videos, based on query tags, is a hard problem due to large intra-class variances, and limited vocabulary and accuracy of the video concept detectors, creating a "semantic query gap". We present a technique to overcome this gap by using continuous word space representations to explicitly compute query and detector concept similarity. This not only allows for fast query-video similarity computation with implicit query expansion, but leads to a compact video representation, which allows implementation of a real-time retrieval system that can fit several thousand videos in a few hundred megabytes of memory. We evaluate the effectiveness of our representation on the challenging NIST MEDTest 2014 dataset. |
Jitao Sang, Yue Gao, Bing-kun Bao, Cees G M Snoek, Qionghai Dai: Recent advances in social multimedia big data mining and applications. Multimedia Systems, 22 (1), pp. 1–3, 2016. (Type: Journal Article | Abstract | BibTeX) @article{SangMS16, title = {Recent advances in social multimedia big data mining and applications}, author = {Jitao Sang and Yue Gao and Bing-kun Bao and Cees G M Snoek and Qionghai Dai}, year = {2016}, date = {2016-02-01}, journal = {Multimedia Systems}, volume = {22}, number = {1}, pages = {1--3}, abstract = {In the past decade, social media contributes significantly to the arrival of the Big Data era. Big Data has not only provided new solutions for social media mining and applications, but brought about a paradigm shift to many fields of data analytics. This special issue solicits recent related attempts in the multimedia community. We believe that the enclosed papers in this special issue provide a unique opportunity for multidisciplinary works connecting both the social media and big data contexts to multimedia computing.}, keywords = {}, pubstate = {published}, tppubtype = {article} } In the past decade, social media contributes significantly to the arrival of the Big Data era. Big Data has not only provided new solutions for social media mining and applications, but brought about a paradigm shift to many fields of data analytics. This special issue solicits recent related attempts in the multimedia community. We believe that the enclosed papers in this special issue provide a unique opportunity for multidisciplinary works connecting both the social media and big data contexts to multimedia computing. | |
![]() | Svetlana Kordumova, Jan C van Gemert, Cees G M Snoek: Exploring the Long Tail of Social Media Tags. International Conference on Multimedia Modelling, Miami, USA, 2016. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{KordumovaMMM16, title = {Exploring the Long Tail of Social Media Tags}, author = {Svetlana Kordumova and Jan C van Gemert and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/kordumova-longtail-mmm2016.pdf}, year = {2016}, date = {2016-01-01}, booktitle = {International Conference on Multimedia Modelling}, address = {Miami, USA}, abstract = {There are millions of users who tag multimedia content, generating a large vocabulary of tags. Some tags are frequent, while other tags are rarely used following a long tail distribution. For frequent tags, most of the multimedia methods that aim to automatically understand audio-visual content, give excellent results. It is not clear, however, how these methods will perform on rare tags. In this paper we investigate what social tags constitute the long tail and how they perform on two multimedia retrieval scenarios, tag relevance and detector learning. We show common valuable tags within the long tail, and by augmenting them with semantic knowledge, the performance of tag relevance and detector learning improves substantially.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } There are millions of users who tag multimedia content, generating a large vocabulary of tags. Some tags are frequent, while other tags are rarely used following a long tail distribution. For frequent tags, most of the multimedia methods that aim to automatically understand audio-visual content, give excellent results. It is not clear, however, how these methods will perform on rare tags. In this paper we investigate what social tags constitute the long tail and how they perform on two multimedia retrieval scenarios, tag relevance and detector learning. We show common valuable tags within the long tail, and by augmenting them with semantic knowledge, the performance of tag relevance and detector learning improves substantially. |
![]() | George Awad, Cees G M Snoek, Alan F Smeaton, Georges Quénot: TRECVid Semantic Indexing of Video: A 6-year Retrospective. ITE Transactions on Media Technology and Applications, 4 (3), pp. 187–208, 2016, (ITE Niwa-Takayanagi Award). (Type: Journal Article | Abstract | Links | BibTeX) @article{AwadTMTA16, title = {TRECVid Semantic Indexing of Video: A 6-year Retrospective}, author = {George Awad and Cees G M Snoek and Alan F Smeaton and Georges Quénot}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/awad-trecvid-retrospective-ite.pdf}, year = {2016}, date = {2016-01-01}, journal = {ITE Transactions on Media Technology and Applications}, volume = {4}, number = {3}, pages = {187--208}, abstract = {Semantic indexing, or assigning semantic tags to video samples, is a key component for content-based access to video documents and collections. The Semantic Indexing task has been run at TRECVid from 2010 to 2015 with the support of NIST and the Quaero project. As with the previous High-Level Feature detection task which ran from 2002 to 2009, the semantic indexing task aims at evaluating methods and systems for detecting visual, auditory or multi-modal concepts in video shots. In addition to the main semantic indexing task, four secondary tasks were proposed namely the ``localization'' task, the ``concept pair'' task, the ``no annotation'' task, and the ``progress'' task. It attracted over 40 research teams during its running period. The task was conducted using a total of 1,400 hours of video data drawn from Internet Archive videos with Creative Commons licenses gathered by NIST. 200 hours of new test data was made available each year plus 200 more as development data in 2010. The number of target concepts to be detected started from 130 in 2010 and was extended to 346 in 2011. Both the increase in the volume of video data and in the number of target concepts favored the development of generic and scalable methods. Over 8 millions shots$times$concepts direct annotations plus over 20 millions indirect ones were produced by the participants and the Quaero project on a total of 800 hours of development data. Significant progress was accomplished during the period as this was accurately measured in the context of the progress task but also from some of the participants' contrast experiments. This paper describes the data, protocol and metrics used for the main and the secondary tasks, the results obtained and the main approaches used by participants.}, note = {ITE Niwa-Takayanagi Award}, keywords = {}, pubstate = {published}, tppubtype = {article} } Semantic indexing, or assigning semantic tags to video samples, is a key component for content-based access to video documents and collections. The Semantic Indexing task has been run at TRECVid from 2010 to 2015 with the support of NIST and the Quaero project. As with the previous High-Level Feature detection task which ran from 2002 to 2009, the semantic indexing task aims at evaluating methods and systems for detecting visual, auditory or multi-modal concepts in video shots. In addition to the main semantic indexing task, four secondary tasks were proposed namely the ``localization'' task, the ``concept pair'' task, the ``no annotation'' task, and the ``progress'' task. It attracted over 40 research teams during its running period. The task was conducted using a total of 1,400 hours of video data drawn from Internet Archive videos with Creative Commons licenses gathered by NIST. 200 hours of new test data was made available each year plus 200 more as development data in 2010. The number of target concepts to be detected started from 130 in 2010 and was extended to 346 in 2011. Both the increase in the volume of video data and in the number of target concepts favored the development of generic and scalable methods. Over 8 millions shots$times$concepts direct annotations plus over 20 millions indirect ones were produced by the participants and the Quaero project on a total of 800 hours of development data. Significant progress was accomplished during the period as this was accurately measured in the context of the progress task but also from some of the participants' contrast experiments. This paper describes the data, protocol and metrics used for the main and the secondary tasks, the results obtained and the main approaches used by participants. |
2015 |
|
![]() | Efstratios Gavves, Thomas Mensink, Tatiana Tommasi, Cees G M Snoek, Tinne Tuytelaars: Active Transfer Learning with Zero-Shot Priors: Reusing Past Datasets for Future Tasks. ICCV, Santiago, Chile, 2015. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{GavvesICCV15, title = {Active Transfer Learning with Zero-Shot Priors: Reusing Past Datasets for Future Tasks}, author = {Efstratios Gavves and Thomas Mensink and Tatiana Tommasi and Cees G M Snoek and Tinne Tuytelaars}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/gavves-zero-shot-priors-iccv2015.pdf}, year = {2015}, date = {2015-12-01}, booktitle = {ICCV}, address = {Santiago, Chile}, abstract = {How can we reuse existing knowledge, in the form of available datasets, when solving a new and apparently unrelated target task from a set of unlabeled data? In this work we make a first contribution to answer this question in the context of image classification. We frame this quest as an active learning problem and use zero-shot classifiers to guide the learning process by linking the new task to the existing classifiers. By revisiting the dual formulation of adaptive SVM, we reveal two basic conditions to choose greedily only the most relevant samples to be annotated. On this basis we propose an effective active learning algorithm which learns the best possible target classification model with minimum human labeling effort. Extensive experiments on two challenging datasets show the value of our approach compared to the state-of-the-art active learning methodologies, as well as its potential to reuse past datasets with minimal effort for future tasks.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } How can we reuse existing knowledge, in the form of available datasets, when solving a new and apparently unrelated target task from a set of unlabeled data? In this work we make a first contribution to answer this question in the context of image classification. We frame this quest as an active learning problem and use zero-shot classifiers to guide the learning process by linking the new task to the existing classifiers. By revisiting the dual formulation of adaptive SVM, we reveal two basic conditions to choose greedily only the most relevant samples to be annotated. On this basis we propose an effective active learning algorithm which learns the best possible target classification model with minimum human labeling effort. Extensive experiments on two challenging datasets show the value of our approach compared to the state-of-the-art active learning methodologies, as well as its potential to reuse past datasets with minimal effort for future tasks. |
![]() | Mihir Jain, Jan C van Gemert, Thomas Mensink, Cees G M Snoek: Objects2action: Classifying and localizing actions without any video example. ICCV, Santiago, Chile, 2015. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{JainICCV15, title = {Objects2action: Classifying and localizing actions without any video example}, author = {Mihir Jain and Jan C van Gemert and Thomas Mensink and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/jain-objects2action-iccv2015.pdf}, year = {2015}, date = {2015-12-01}, booktitle = {ICCV}, address = {Santiago, Chile}, abstract = {The goal of this paper is to recognize actions in video without the need for examples. Different from traditional zero-shot approaches we do not demand the design and specification of attribute classifiers and class-to-attribute mappings to allow for transfer from seen classes to unseen classes. Our key contribution is objects2action, a semantic word embedding that is spanned by a skip-gram model of thousands of object categories. Action labels are assigned to an object encoding of unseen video based on a convex combination of action and object affinities. Our semantic embedding has three main characteristics to accommodate for the specifics of actions. First, we propose a mechanism to exploit multiple-word descriptions of actions and objects. Second, we incorporate the automated selection of the most responsive objects per action. And finally, we demonstrate how to extend our zero-shot approach to the spatio-temporal localization of actions in video. Experiments on four action datasets demonstrate the potential of our approach.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } The goal of this paper is to recognize actions in video without the need for examples. Different from traditional zero-shot approaches we do not demand the design and specification of attribute classifiers and class-to-attribute mappings to allow for transfer from seen classes to unseen classes. Our key contribution is objects2action, a semantic word embedding that is spanned by a skip-gram model of thousands of object categories. Action labels are assigned to an object encoding of unseen video based on a convex combination of action and object affinities. Our semantic embedding has three main characteristics to accommodate for the specifics of actions. First, we propose a mechanism to exploit multiple-word descriptions of actions and objects. Second, we incorporate the automated selection of the most responsive objects per action. And finally, we demonstrate how to extend our zero-shot approach to the spatio-temporal localization of actions in video. Experiments on four action datasets demonstrate the potential of our approach. |
![]() | Cees G M Snoek, Spencer Cappallo, Daniel Fontijne, David Julian, Dennis C Koelma, Pascal Mettes, Koen E A van de Sande, Anthony Sarah, Harro Stokman, R Blythe Towal: Qualcomm Research and University of Amsterdam at TRECVID 2015: Recognizing Concepts, Objects, and Events in Video. TRECVID, Gaithersburg, USA, 2015. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{SnoekTRECVID15, title = {Qualcomm Research and University of Amsterdam at TRECVID 2015: Recognizing Concepts, Objects, and Events in Video}, author = {Cees G M Snoek and Spencer Cappallo and Daniel Fontijne and David Julian and Dennis C Koelma and Pascal Mettes and Koen E A van de Sande and Anthony Sarah and Harro Stokman and R Blythe Towal}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/mediamill-TRECVID2015-final.pdf}, year = {2015}, date = {2015-11-01}, booktitle = {TRECVID}, address = {Gaithersburg, USA}, abstract = {In this paper we summarize our TRECVID 2015 video recognition experiments. We participated in three tasks: concept detection, object localization, and event recognition, where Qualcomm Research focused on concept detection and object localization and the University of Amsterdam focused on event detection. For concept detection we start from the very deep networks that excelled in the ImageNet 2014 competition and redesign them for the purpose of video recognition, emphasizing on training data augmentation %, permutation, and dropout as well as video fine-tuning. Our entry in the localization task is based on classifying a limited number of boxes in each frame using deep learning features. The boxes are proposed by an improved version of selective search. At the core of our multimedia event detection system is an Inception-style deep convolutional neural network that is trained on the full ImageNet hierarchy with 22k categories. We propose several operations that combine and generalize the ImageNet categories to form a desirable set of (super-)categories, while still being able to train a reliable model. The 2015 edition of the TRECVID benchmark has been a fruitful participation for our team, resulting in the best overall result for concept detection, object localization and event detection.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } In this paper we summarize our TRECVID 2015 video recognition experiments. We participated in three tasks: concept detection, object localization, and event recognition, where Qualcomm Research focused on concept detection and object localization and the University of Amsterdam focused on event detection. For concept detection we start from the very deep networks that excelled in the ImageNet 2014 competition and redesign them for the purpose of video recognition, emphasizing on training data augmentation %, permutation, and dropout as well as video fine-tuning. Our entry in the localization task is based on classifying a limited number of boxes in each frame using deep learning features. The boxes are proposed by an improved version of selective search. At the core of our multimedia event detection system is an Inception-style deep convolutional neural network that is trained on the full ImageNet hierarchy with 22k categories. We propose several operations that combine and generalize the ImageNet categories to form a desirable set of (super-)categories, while still being able to train a reliable model. The 2015 edition of the TRECVID benchmark has been a fruitful participation for our team, resulting in the best overall result for concept detection, object localization and event detection. |
![]() | Spencer Cappallo, Thomas Mensink, Cees G M Snoek: Image2Emoji: Zero-shot Emoji Prediction for Visual Media. MM, Brisbane, Australia, 2015. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{CappalloMM15, title = {Image2Emoji: Zero-shot Emoji Prediction for Visual Media}, author = {Spencer Cappallo and Thomas Mensink and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/cappallo-image2emoji-mm2015.pdf}, year = {2015}, date = {2015-10-01}, booktitle = {MM}, address = {Brisbane, Australia}, abstract = {We present Image2Emoji, a multi-modal approach for generating emoji labels for an image in a zero-shot manner. Different from existing zero-shot image-to-text approaches, we exploit both image and textual media to learn a semantic embedding for the new task of emoji prediction. We propose that the widespread adoption of emoji suggests a semantic universality which is well-suited for interaction with visual media. We quantify the efficacy of our proposed model on the MSCOCO dataset, and demonstrate the value of visual, textual and multi-modal prediction of emoji. We conclude the paper with three examples of the application potential of emoji in the context of multimedia retrieval.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } We present Image2Emoji, a multi-modal approach for generating emoji labels for an image in a zero-shot manner. Different from existing zero-shot image-to-text approaches, we exploit both image and textual media to learn a semantic embedding for the new task of emoji prediction. We propose that the widespread adoption of emoji suggests a semantic universality which is well-suited for interaction with visual media. We quantify the efficacy of our proposed model on the MSCOCO dataset, and demonstrate the value of visual, textual and multi-modal prediction of emoji. We conclude the paper with three examples of the application potential of emoji in the context of multimedia retrieval. |
![]() | Spencer Cappallo, Thomas Mensink, Cees G M Snoek: Query-by-Emoji Video Search. MM, Brisbane, Australia, 2015. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{CappalloACM15, title = {Query-by-Emoji Video Search}, author = {Spencer Cappallo and Thomas Mensink and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/cappallo-emoji2video-mm2015.pdf}, year = {2015}, date = {2015-10-01}, booktitle = {MM}, address = {Brisbane, Australia}, abstract = {This technical demo presents Emoji2Video, a query-by-emoji interface for exploring video collections. Ideogram-based video search and representation presents an opportunity for an intuitive, visual interface and concise non-textual summary of video contents, in a form factor that is ideal for small screens. The demo allows users to build search strings comprised of ideograms which are used to query a large dataset of YouTube videos. The system returns a list of the top-ranking videos for the user query along with an emoji summary of the video contents so that users maymake an informed decision whether to view a video or refine their search terms. The ranking of the videos is done in a zero-shot, multi-modal manner that employs an embedding space to exploit semantic relationships between user-selected ideograms and the video's visual and textual content.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } This technical demo presents Emoji2Video, a query-by-emoji interface for exploring video collections. Ideogram-based video search and representation presents an opportunity for an intuitive, visual interface and concise non-textual summary of video contents, in a form factor that is ideal for small screens. The demo allows users to build search strings comprised of ideograms which are used to query a large dataset of YouTube videos. The system returns a list of the top-ranking videos for the user query along with an emoji summary of the video contents so that users maymake an informed decision whether to view a video or refine their search terms. The ranking of the videos is done in a zero-shot, multi-modal manner that employs an embedding space to exploit semantic relationships between user-selected ideograms and the video's visual and textual content. |
![]() | Jan van Gemert, Mihir Jain, Ella Gati, Cees G M Snoek: APT: Action localization proposals from dense trajectories. BMVC, Swansea, UK, 2015. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{GemertBMVC15, title = {APT: Action localization proposals from dense trajectories}, author = {Jan van Gemert and Mihir Jain and Ella Gati and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/gemert-apt-proposals-bmvc2015-corrected.pdf https://github.com/jvgemert/apt}, year = {2015}, date = {2015-09-01}, booktitle = {BMVC}, address = {Swansea, UK}, abstract = {This paper is on action localization in video with the aid of spatio-temporal proposals. To alleviate the computational expensive segmentation step of existing proposals, we propose bypassing the segmentations completely by generating proposals directly from the dense trajectories used to represent videos during classification. Our Action localization Proposals from dense Trajectories (APT) use an efficient proposal generation algorithm to handle the high number of trajectories in a video. Our spatio-temporal proposals are faster than current methods and outperform the localization and classification accuracy of current proposals on the UCF Sports, UCF 101, and MSR-II video datasets. Corrected version: we fixed a mistake in our UCF-101 ground truth. Numbers are different; conclusions are unchanged.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } This paper is on action localization in video with the aid of spatio-temporal proposals. To alleviate the computational expensive segmentation step of existing proposals, we propose bypassing the segmentations completely by generating proposals directly from the dense trajectories used to represent videos during classification. Our Action localization Proposals from dense Trajectories (APT) use an efficient proposal generation algorithm to handle the high number of trajectories in a video. Our spatio-temporal proposals are faster than current methods and outperform the localization and classification accuracy of current proposals on the UCF Sports, UCF 101, and MSR-II video datasets. Corrected version: we fixed a mistake in our UCF-101 ground truth. Numbers are different; conclusions are unchanged. |
![]() | Markus Nagel, Thomas Mensink, Cees G M Snoek: Event Fisher Vectors: Robust Encoding Visual Diversity of Visual Streams. BMVC, Swansea, UK, 2015. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{NagelBMVC15, title = {Event Fisher Vectors: Robust Encoding Visual Diversity of Visual Streams}, author = {Markus Nagel and Thomas Mensink and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/nagel-event-fisher-bmvc2015.pdf}, year = {2015}, date = {2015-09-01}, booktitle = {BMVC}, address = {Swansea, UK}, abstract = {In this paper we focus on event recognition in visual image streams. More specifically, we aim to construct a compact representation which encodes the diversity of the visual stream from just a few observations. For this purpose, we introduce the Event Fisher Vector, a Fisher Kernel based representation to describe a collection of images or the sequential frames of a video. We explore different generative models beyond the Gaussian mixture model as underlying probability distribution. First, the Student's-t mixture model which captures the heavy tails of the small sample size of a collection of images. Second, Hidden Markov Models to explicitly capture the temporal ordering of the observations in a stream. For all our models we derive analytical approximations of the Fisher information matrix, which significantly improves recognition performance. We extensively evaluate the properties of our proposed method on three recent datasets for event recognition in photo collections and web videos, leading to an efficient compact image representation which achieves state-of-the-art performance on all these datasets.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } In this paper we focus on event recognition in visual image streams. More specifically, we aim to construct a compact representation which encodes the diversity of the visual stream from just a few observations. For this purpose, we introduce the Event Fisher Vector, a Fisher Kernel based representation to describe a collection of images or the sequential frames of a video. We explore different generative models beyond the Gaussian mixture model as underlying probability distribution. First, the Student's-t mixture model which captures the heavy tails of the small sample size of a collection of images. Second, Hidden Markov Models to explicitly capture the temporal ordering of the observations in a stream. For all our models we derive analytical approximations of the Fisher information matrix, which significantly improves recognition performance. We extensively evaluate the properties of our proposed method on three recent datasets for event recognition in photo collections and web videos, leading to an efficient compact image representation which achieves state-of-the-art performance on all these datasets. |
![]() | Spencer Cappallo, Thomas Mensink, Cees G M Snoek: Latent Factors of Visual Popularity Prediction. ICMR, Shanghai, China, 2015. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{CappalloICMR15, title = {Latent Factors of Visual Popularity Prediction}, author = {Spencer Cappallo and Thomas Mensink and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/cappallo-visual-popularity-icmr2015.pdf}, year = {2015}, date = {2015-06-01}, booktitle = {ICMR}, address = {Shanghai, China}, abstract = {Predicting the popularity of an image on social networks based solely on its visual content is a difficult problem. One image may become widely distributed and repeatedly shared, while another similar image may be totally overlooked. We aim to gain insight into how visual content affects image popularity. We propose a latent ranking approach that takes into account not only the distinctive visual cues in popular images, but also those in unpopular images. This method is evaluated on two existing datasets collected from photo-sharing websites, as well as a new proposed dataset of images from the microblogging website Twitter. Our experiments investigate factors of the ranking model, the level of user engagement in scoring popularity, and whether the discovered senses are meaningful. The proposed approach yields state of the art results, and allows for insight into the semantics of image popularity on social networks.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Predicting the popularity of an image on social networks based solely on its visual content is a difficult problem. One image may become widely distributed and repeatedly shared, while another similar image may be totally overlooked. We aim to gain insight into how visual content affects image popularity. We propose a latent ranking approach that takes into account not only the distinctive visual cues in popular images, but also those in unpopular images. This method is evaluated on two existing datasets collected from photo-sharing websites, as well as a new proposed dataset of images from the microblogging website Twitter. Our experiments investigate factors of the ranking model, the level of user engagement in scoring popularity, and whether the discovered senses are meaningful. The proposed approach yields state of the art results, and allows for insight into the semantics of image popularity on social networks. |
![]() | Amirhossein Habibian, Thomas Mensink, Cees G M Snoek: Discovering Semantic Vocabularies for Cross-Media Retrieval. ICMR, Shanghai, China, 2015. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{HabibianICMR15, title = {Discovering Semantic Vocabularies for Cross-Media Retrieval}, author = {Amirhossein Habibian and Thomas Mensink and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/habibian-semantic-vocabularies-icmr2015.pdf}, year = {2015}, date = {2015-06-01}, booktitle = {ICMR}, address = {Shanghai, China}, abstract = {This paper proposes a data-driven approach for cross-media retrieval by automatically learning its underlying semantic vocabulary. Different from the existing semantic vocabularies, which are manually pre-defined and annotated, we automatically discover the vocabulary concepts and their annotations from multimedia collections. To this end, we apply a probabilistic topic model on the text available in the collection to extractits semantic structure. Moreover, we propose a learning to rank framework, to effectively learn the concept classifiers from the extracted annotations. We evaluate the discovered semantic vocabulary for cross-media retrieval on three datasets of image/text and video/text pairs. Our experiments demonstrate that the discovered vocabulary does not require any manual labeling to outperform three recent alternatives for cross-media retrieval.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } This paper proposes a data-driven approach for cross-media retrieval by automatically learning its underlying semantic vocabulary. Different from the existing semantic vocabularies, which are manually pre-defined and annotated, we automatically discover the vocabulary concepts and their annotations from multimedia collections. To this end, we apply a probabilistic topic model on the text available in the collection to extractits semantic structure. Moreover, we propose a learning to rank framework, to effectively learn the concept classifiers from the extracted annotations. We evaluate the discovered semantic vocabulary for cross-media retrieval on three datasets of image/text and video/text pairs. Our experiments demonstrate that the discovered vocabulary does not require any manual labeling to outperform three recent alternatives for cross-media retrieval. |
![]() | Masoud Mazloom, Amirhossein Habibian, Dong Liu, Cees G M Snoek, Shih-Fu Chang: Encoding Concept Prototypes for Video Event Detection and Summarization. ICMR, Shanghai, China, 2015. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{MazloomICMR15, title = {Encoding Concept Prototypes for Video Event Detection and Summarization}, author = {Masoud Mazloom and Amirhossein Habibian and Dong Liu and Cees G M Snoek and Shih-Fu Chang}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/mazloom-concept-prototypes-icmr2015.pdf}, year = {2015}, date = {2015-06-01}, booktitle = {ICMR}, address = {Shanghai, China}, abstract = {This paper proposes a new semantic video representation for few and zero example event detection and unsupervised video event summarization. Different from existing works, which obtain a semantic representation by training concepts over images or entire video clips, we propose an algorithm that learns a set of relevant frames as the concept prototypes from web video examples, without the need for frame-level annotations, and use them for representing an event video. We formulate the problem of learning the concept prototypes as seeking the frames closest to the densest region in the feature space of video frames from both positive and negative training videos of a target concept. We study the behavior of our video event representation based on concept prototypes by performing three experiments on challenging web videos from the TRECVID 2013 multimedia event detection task and the MED-summaries dataset. Our experiments establish that i) Event detection accuracy increases when mapping each video into concept prototype space. ii) Zero-example event detection increases by analyzing each frame of a video individually in concept prototype space, rather than considering the holistic videos. iii) Unsupervised video event summarization using concept prototypes is more accurate than using video-level concept detectors.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } This paper proposes a new semantic video representation for few and zero example event detection and unsupervised video event summarization. Different from existing works, which obtain a semantic representation by training concepts over images or entire video clips, we propose an algorithm that learns a set of relevant frames as the concept prototypes from web video examples, without the need for frame-level annotations, and use them for representing an event video. We formulate the problem of learning the concept prototypes as seeking the frames closest to the densest region in the feature space of video frames from both positive and negative training videos of a target concept. We study the behavior of our video event representation based on concept prototypes by performing three experiments on challenging web videos from the TRECVID 2013 multimedia event detection task and the MED-summaries dataset. Our experiments establish that i) Event detection accuracy increases when mapping each video into concept prototype space. ii) Zero-example event detection increases by analyzing each frame of a video individually in concept prototype space, rather than considering the holistic videos. iii) Unsupervised video event summarization using concept prototypes is more accurate than using video-level concept detectors. |
![]() | Pascal Mettes, Jan C van Gemert, Spencer Cappallo, Thomas Mensink, Cees G M Snoek: Bag-of-Fragments: Selecting and encoding video fragments for event detection and recounting. ICMR, Shanghai, China, 2015. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{MettesICMR15, title = {Bag-of-Fragments: Selecting and encoding video fragments for event detection and recounting}, author = {Pascal Mettes and Jan C van Gemert and Spencer Cappallo and Thomas Mensink and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/mettes-bag-of-fragments-icmr2015.pdf}, year = {2015}, date = {2015-06-01}, booktitle = {ICMR}, address = {Shanghai, China}, abstract = {The goal of this paper is event detection and recounting using a representation of concept detector scores. Different from existing work, which encodes videos by averaging concept scores over all frames, we propose to encode videos using fragments that are discriminatively learned per event. Our bag-of-fragments split a video into semantically coherent fragment proposals. From training video proposals we show how to select the most discriminative fragment for an event. An encoding of a video is in turn generated by matching and pooling these discriminative fragments to the fragment proposals of the video. The bag-of-fragments forms an effective encoding for event detection and is able to provide a precise temporally localized event recounting. Furthermore, we show how bag-of-fragments can be extended to deal with irrelevant concepts in the event recounting. Experiments on challenging web videos show that i) our modest number of fragment proposals give a high sub-event recall, ii) bag-of-fragments is complementary to global averaging and provides better event detection, iii) bag-of-fragments with concept filtering yields a desirable event recounting. We conclude that fragments matter for video event detection and recounting.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } The goal of this paper is event detection and recounting using a representation of concept detector scores. Different from existing work, which encodes videos by averaging concept scores over all frames, we propose to encode videos using fragments that are discriminatively learned per event. Our bag-of-fragments split a video into semantically coherent fragment proposals. From training video proposals we show how to select the most discriminative fragment for an event. An encoding of a video is in turn generated by matching and pooling these discriminative fragments to the fragment proposals of the video. The bag-of-fragments forms an effective encoding for event detection and is able to provide a precise temporally localized event recounting. Furthermore, we show how bag-of-fragments can be extended to deal with irrelevant concepts in the event recounting. Experiments on challenging web videos show that i) our modest number of fragment proposals give a high sub-event recall, ii) bag-of-fragments is complementary to global averaging and provides better event detection, iii) bag-of-fragments with concept filtering yields a desirable event recounting. We conclude that fragments matter for video event detection and recounting. |
![]() | Mihir Jain, Jan C van Gemert, Cees G M Snoek: What do 15,000 object categories tell us about classifying and localizing actions?. CVPR, Boston, MA, USA, 2015. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{JainCVPR15, title = {What do 15,000 object categories tell us about classifying and localizing actions?}, author = {Mihir Jain and Jan C van Gemert and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/jain-objects-actions-cvpr2015.pdf}, year = {2015}, date = {2015-06-01}, booktitle = {CVPR}, address = {Boston, MA, USA}, abstract = {This paper contributes to automatic classification and localization of human actions in video. Whereas motion is the key ingredient in modern approaches, we assess the benefits of having objects in the video representation. Rather than considering a handful of carefully selected and localized objects, we conduct an empirical study on the benefit of encoding 15,000 object categories for action using 6 datasets totaling more than 200 hours of video and covering 180 action classes. Our key contributions are i) the first in-depth study of encoding objects for actions, ii) we show that objects matter for actions, and are often semantically relevant as well. iii) We establish that actions have object preferences. Rather than using all objects, selection is advantageous for action recognition. iv) We reveal that object-action relations are generic, which allows to transferring these relationships from the one domain to the other. And, v) objects, when combined with motion, improve the state-of-the-art for both action classification and localization.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } This paper contributes to automatic classification and localization of human actions in video. Whereas motion is the key ingredient in modern approaches, we assess the benefits of having objects in the video representation. Rather than considering a handful of carefully selected and localized objects, we conduct an empirical study on the benefit of encoding 15,000 object categories for action using 6 datasets totaling more than 200 hours of video and covering 180 action classes. Our key contributions are i) the first in-depth study of encoding objects for actions, ii) we show that objects matter for actions, and are often semantically relevant as well. iii) We establish that actions have object preferences. Rather than using all objects, selection is advantageous for action recognition. iv) We reveal that object-action relations are generic, which allows to transferring these relationships from the one domain to the other. And, v) objects, when combined with motion, improve the state-of-the-art for both action classification and localization. |
![]() | Mihir Jain, Jan van Gemert, Pascal Mettes, Cees G M Snoek: University of Amsterdam at THUMOS Challenge 2015. CVPR workshop, Boston, USA, 2015. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{JainTHUMOS15, title = {University of Amsterdam at THUMOS Challenge 2015}, author = {Mihir Jain and Jan van Gemert and Pascal Mettes and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/jain-THUMOS2015-final.pdf}, year = {2015}, date = {2015-06-01}, booktitle = {CVPR workshop}, address = {Boston, USA}, abstract = {This notebook paper describes our approach for the action classification task of the THUMOS 2015 benchmark challenge. We use two types of representations to capture motion and appearance. For a local motion description we employ HOG, HOF and MBH features, computed along the improved dense trajectories. The motion features are encoded into a fixed-length representation using Fisher vectors. For the appearance features, we employ a pre-trained GoogLeNet convolutional network on video frames. VLAD is used to encode the appearance features into a fixed-length representation. All actions are classified with a one-vs-rest linear SVM.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } This notebook paper describes our approach for the action classification task of the THUMOS 2015 benchmark challenge. We use two types of representations to capture motion and appearance. For a local motion description we employ HOG, HOF and MBH features, computed along the improved dense trajectories. The motion features are encoded into a fixed-length representation using Fisher vectors. For the appearance features, we employ a pre-trained GoogLeNet convolutional network on video frames. VLAD is used to encode the appearance features into a fixed-length representation. All actions are classified with a one-vs-rest linear SVM. |
Meng Wang, Ke Lu, Gang Hua, Cees G M Snoek: Guest editorial: selected papers from ICIMCS 2013. Multimedia Systems, 21 (2), pp. 131–132, 2015. (Type: Journal Article | BibTeX) @article{WangMS15, title = {Guest editorial: selected papers from ICIMCS 2013}, author = {Meng Wang and Ke Lu and Gang Hua and Cees G M Snoek}, year = {2015}, date = {2015-03-01}, journal = {Multimedia Systems}, volume = {21}, number = {2}, pages = {131--132}, keywords = {}, pubstate = {published}, tppubtype = {article} } | |
![]() | Svetlana Kordumova, Xirong Li, Cees G M Snoek: Best Practices for Learning Video Concept Detectors from Social Media Examples. Multimedia Tools and Applications, 74 (4), pp. 1291–1315, 2015. (Type: Journal Article | Abstract | Links | BibTeX) @article{KordumovaMMTA15, title = {Best Practices for Learning Video Concept Detectors from Social Media Examples}, author = {Svetlana Kordumova and Xirong Li and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/kordumova-practices-mmta.pdf}, year = {2015}, date = {2015-02-01}, journal = {Multimedia Tools and Applications}, volume = {74}, number = {4}, pages = {1291--1315}, abstract = {Learning video concept detectors from social media sources, such as Flickr images and YouTube videos, has the potential to address a wide variety of concept queries for video search. While the potential has been recognized by many, and progress on the topic has been impressive, we argue that key questions crucial to know how to learn effective video concept detectors from social media examples? remain open. As an initial attempt to answer these questions, we conduct an experimental study using a video search engine which is capable of learning concept detectors from social media examples, be it socially tagged videos or socially tagged images. Within the video search engine we investigate three strategies for positive example selection, three negative example selection strategies and three learning strategies. The performance is evaluated on the challenging TRECVID 2012 benchmark consisting of 600 h of Internet video. From the experiments we derive four best practices: (1) tagged images are a better source for learning video concepts than tagged videos, (2) selecting tag relevant positive training examples is always beneficial, (3) selecting relevant negative examples is advantageous and should be treated differently for video and image sources, and (4) learning concept detectors with selected relevant training data before learning is better then incorporating the relevance during the learning process. The best practices within our video search engine lead to state-of-the-art performance in the TRECVID 2013 benchmark for concept detection without manually provided annotations.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Learning video concept detectors from social media sources, such as Flickr images and YouTube videos, has the potential to address a wide variety of concept queries for video search. While the potential has been recognized by many, and progress on the topic has been impressive, we argue that key questions crucial to know how to learn effective video concept detectors from social media examples? remain open. As an initial attempt to answer these questions, we conduct an experimental study using a video search engine which is capable of learning concept detectors from social media examples, be it socially tagged videos or socially tagged images. Within the video search engine we investigate three strategies for positive example selection, three negative example selection strategies and three learning strategies. The performance is evaluated on the challenging TRECVID 2012 benchmark consisting of 600 h of Internet video. From the experiments we derive four best practices: (1) tagged images are a better source for learning video concepts than tagged videos, (2) selecting tag relevant positive training examples is always beneficial, (3) selecting relevant negative examples is advantageous and should be treated differently for video and image sources, and (4) learning concept detectors with selected relevant training data before learning is better then incorporating the relevance during the learning process. The best practices within our video search engine lead to state-of-the-art performance in the TRECVID 2013 benchmark for concept detection without manually provided annotations. |
![]() | Efstratios Gavves, Basura Fernando, Cees G M Snoek, Arnold W M Smeulders, Tinne Tuytelaars: Local Alignments for Fine-Grained Categorization. International Journal of Computer Vision, 111 (2), pp. 191–212, 2015. (Type: Journal Article | Abstract | Links | BibTeX) @article{GavvesIJCV15, title = {Local Alignments for Fine-Grained Categorization}, author = {Efstratios Gavves and Basura Fernando and Cees G M Snoek and Arnold W M Smeulders and Tinne Tuytelaars}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/gavves-finegrained-ijcv.pdf}, year = {2015}, date = {2015-01-01}, journal = {International Journal of Computer Vision}, volume = {111}, number = {2}, pages = {191--212}, abstract = {The aim of this paper is fine-grained categorization without human interaction. Different from prior work, which relies on detectors for specific object parts, we propose to localize distinctive details by roughly aligning the objects using just the overall shape. Then, one may proceed to the differential classification by examining the corresponding regions of the alignments. More specifically, the alignments are used to transfer part annotations from training images to unseen images (supervised alignment), or to blindly yet consistently segment the object in a number of regions (unsupervised alignment). We further argue that for the distinction of sub-classes, distribution-based features like color Fisher vectors are better suited for describing localized appearance of fine-grained categories than popular matching oriented intensity features, like HOG. They allow capturing the subtle local differences between subclasses, while at the same time being robust to misalignments between distinctive details. We evaluate the local alignments on the CUB-2011 and on the Stanford Dogs datasets, composed of 200 and 120, visually very hard to distinguish bird and dog species. In our experiments we study and show the benefit of the color Fisher vector parameterization, the influence of the alignment partitioning, and the significance of object segmentation on fine-grained categorization. We, furthermore, show that by using object detectors as voters to generate object confidence saliency maps, we arrive at fully unsupervised, yet highly accurate fine-grained categorization. The proposed local alignments set a new state-of-the-art on both the fine-grained birds and dogs datasets, even without any human intervention. What is more, the local alignments reveal what appearance details are most decisive per fine-grained object category.}, keywords = {}, pubstate = {published}, tppubtype = {article} } The aim of this paper is fine-grained categorization without human interaction. Different from prior work, which relies on detectors for specific object parts, we propose to localize distinctive details by roughly aligning the objects using just the overall shape. Then, one may proceed to the differential classification by examining the corresponding regions of the alignments. More specifically, the alignments are used to transfer part annotations from training images to unseen images (supervised alignment), or to blindly yet consistently segment the object in a number of regions (unsupervised alignment). We further argue that for the distinction of sub-classes, distribution-based features like color Fisher vectors are better suited for describing localized appearance of fine-grained categories than popular matching oriented intensity features, like HOG. They allow capturing the subtle local differences between subclasses, while at the same time being robust to misalignments between distinctive details. We evaluate the local alignments on the CUB-2011 and on the Stanford Dogs datasets, composed of 200 and 120, visually very hard to distinguish bird and dog species. In our experiments we study and show the benefit of the color Fisher vector parameterization, the influence of the alignment partitioning, and the significance of object segmentation on fine-grained categorization. We, furthermore, show that by using object detectors as voters to generate object confidence saliency maps, we arrive at fully unsupervised, yet highly accurate fine-grained categorization. The proposed local alignments set a new state-of-the-art on both the fine-grained birds and dogs datasets, even without any human intervention. What is more, the local alignments reveal what appearance details are most decisive per fine-grained object category. |
Amirhossein Habibian, Cees G M Snoek: Vocabularies for Video Event Detection. Webster, John G (Ed.): Wiley Encyclopedia of Electrical and Electronics Engineering, John Wiley & Sons, Inc., 2015. (Type: Book Chapter | Abstract | BibTeX) @inbook{HabibianEEEE15, title = {Vocabularies for Video Event Detection}, author = {Amirhossein Habibian and Cees G M Snoek}, editor = {John G Webster}, year = {2015}, date = {2015-01-01}, booktitle = {Wiley Encyclopedia of Electrical and Electronics Engineering}, publisher = {John Wiley & Sons, Inc.}, abstract = {In general, event detection systems can be characterized by two main components: video vocabulary and event modeling. The video vocabulary converts the raw video data into a feature vector, which contains the multimodal information considered useful for detecting the event of interest. The event modeling component learns a model for each event detection. This article discusses both low- and high-level vocabularies for video event detection. In addition, representative benchmarks for evaluating event detection are also presented.}, keywords = {}, pubstate = {published}, tppubtype = {inbook} } In general, event detection systems can be characterized by two main components: video vocabulary and event modeling. The video vocabulary converts the raw video data into a feature vector, which contains the multimodal information considered useful for detecting the event of interest. The event modeling component learns a model for each event detection. This article discusses both low- and high-level vocabularies for video event detection. In addition, representative benchmarks for evaluating event detection are also presented. | |
2014 |
|
![]() | Masoud Mazloom, Efstratios Gavves, Cees G M Snoek: Conceptlets: Selective Semantics for Classifying Video Events. IEEE Transactions on Multimedia, 16 (8), pp. 2214–2228, 2014. (Type: Journal Article | Abstract | Links | BibTeX) @article{MazloomTMM14, title = {Conceptlets: Selective Semantics for Classifying Video Events}, author = {Masoud Mazloom and Efstratios Gavves and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/mazloom-conceptlets-tmm.pdf}, year = {2014}, date = {2014-12-01}, journal = {IEEE Transactions on Multimedia}, volume = {16}, number = {8}, pages = {2214--2228}, abstract = {An emerging trend in video event classification is to learn an event from a bank of concept detector scores. Different from existing work, which simply relies on a bank containing all available detectors, we propose in this paper an algorithm that learns from examples what concepts in a bank are most informative per event, which we call the conceptlet. We model finding the conceptlet out of a large set of concept detectors as an importance sampling problem. Our proposed approximate algorithm finds the optimal conceptlet using a cross-entropy optimization. We study the behavior of video event classification based on conceptlets by performing four experiments on challenging internet video from the 2010 and 2012 TRECVID multimedia event detection tasks and Columbia's consumer video dataset. Starting from a concept bank of more than thousand precomputed detectors, our experiments establish there are (sets of) individual concept detectors that are more discriminative and appear to be more descriptive for a particular event than others, event classification using an automatically obtained conceptlet is more robust than using all available concepts, and concept lets obtained with our cross-entropy algorithm are better than conceptlets from state-of-the-art feature selection algorithms. What is more, the conceptlets make sense for the events of interest, without being programmed to do so.}, keywords = {}, pubstate = {published}, tppubtype = {article} } An emerging trend in video event classification is to learn an event from a bank of concept detector scores. Different from existing work, which simply relies on a bank containing all available detectors, we propose in this paper an algorithm that learns from examples what concepts in a bank are most informative per event, which we call the conceptlet. We model finding the conceptlet out of a large set of concept detectors as an importance sampling problem. Our proposed approximate algorithm finds the optimal conceptlet using a cross-entropy optimization. We study the behavior of video event classification based on conceptlets by performing four experiments on challenging internet video from the 2010 and 2012 TRECVID multimedia event detection tasks and Columbia's consumer video dataset. Starting from a concept bank of more than thousand precomputed detectors, our experiments establish there are (sets of) individual concept detectors that are more discriminative and appear to be more descriptive for a particular event than others, event classification using an automatically obtained conceptlet is more robust than using all available concepts, and concept lets obtained with our cross-entropy algorithm are better than conceptlets from state-of-the-art feature selection algorithms. What is more, the conceptlets make sense for the events of interest, without being programmed to do so. |
![]() | Amirhossein Habibian, Thomas Mensink, Cees G M Snoek: VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events. MM, pp. 17–26, Orlando, Florida, USA, 2014, (Best paper award). (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{HabibianMM14, title = {VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events}, author = {Amirhossein Habibian and Thomas Mensink and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/habibian-videostory-mm2014.pdf}, year = {2014}, date = {2014-11-01}, booktitle = {MM}, pages = {17--26}, address = {Orlando, Florida, USA}, abstract = {This paper proposes a new video representation for few-example event recognition and translation. Different from existing representations, which rely on either low-level features, or pre-specified attributes, we propose to learn an embedding from videos and their descriptions. In our embedding, which we call VideoStory, correlated term labels are combined if their combination improves the video classifier prediction. Our proposed algorithm prevents the combination of correlated terms which are visually dissimilar by optimizing a joint-objective balancing descriptiveness and predictability. The algorithm learns from textual descriptions of video content, which we obtain for free from the web by a simple spidering procedure. We use our VideoStory representation for few-example recognition of events on more than 65K challenging web videos from the NIST TRECVID event detection task and the Columbia Consumer Video collection. Our experiments establish that i) VideoStory outperforms an embedding without joint-objective and alternatives without any embedding, ii) The varying quality of input video descriptions from the web is compensated by harvesting more data, iii) VideoStory sets a new state-of-the-art for few-example event recognition, outperforming very recent attribute and low-level motion encodings. What is more, VideoStory translates a previously unseen video to its most likely description from visual content only.}, note = {Best paper award}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } This paper proposes a new video representation for few-example event recognition and translation. Different from existing representations, which rely on either low-level features, or pre-specified attributes, we propose to learn an embedding from videos and their descriptions. In our embedding, which we call VideoStory, correlated term labels are combined if their combination improves the video classifier prediction. Our proposed algorithm prevents the combination of correlated terms which are visually dissimilar by optimizing a joint-objective balancing descriptiveness and predictability. The algorithm learns from textual descriptions of video content, which we obtain for free from the web by a simple spidering procedure. We use our VideoStory representation for few-example recognition of events on more than 65K challenging web videos from the NIST TRECVID event detection task and the Columbia Consumer Video collection. Our experiments establish that i) VideoStory outperforms an embedding without joint-objective and alternatives without any embedding, ii) The varying quality of input video descriptions from the web is compensated by harvesting more data, iii) VideoStory sets a new state-of-the-art for few-example event recognition, outperforming very recent attribute and low-level motion encodings. What is more, VideoStory translates a previously unseen video to its most likely description from visual content only. |
![]() | Cees G M Snoek, Koen E A van de Sande, Daniel Fontijne, Spencer Cappallo, Jan van Gemert, Amirhossein Habibian, Thomas Mensink, Pascal Mettes, Ran Tao, Dennis C Koelma, Arnold W M Smeulders: MediaMill at TRECVID 2014: Searching Concepts, Objects, Instances and Events in Video. TRECVID, Orlando USA, 2014. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{SnoekTRECVID14, title = {MediaMill at TRECVID 2014: Searching Concepts, Objects, Instances and Events in Video}, author = {Cees G M Snoek and Koen E A van de Sande and Daniel Fontijne and Spencer Cappallo and Jan van Gemert and Amirhossein Habibian and Thomas Mensink and Pascal Mettes and Ran Tao and Dennis C Koelma and Arnold W M Smeulders}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/mediamill-TRECVID2014-final.pdf}, year = {2014}, date = {2014-11-01}, booktitle = {TRECVID}, address = {Orlando USA}, abstract = {In this paper we summarize our TRECVID 2014 video retrieval experiments. The MediaMill team participated in five tasks: concept detection, object localization, instance search, event recognition and recounting. We experimented with concept detection using deep learning and color difference coding, object localization using FLAIR, instance search by one example, event recognition based on VideoStory, and event recounting using COSTA. Our experiments focus on establishing the video retrieval value of these innovations. The 2014 edition of the TRECVID benchmark has again been a fruitful participation for the MediaMill team, resulting in the best result for concept detection and object localization.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } In this paper we summarize our TRECVID 2014 video retrieval experiments. The MediaMill team participated in five tasks: concept detection, object localization, instance search, event recognition and recounting. We experimented with concept detection using deep learning and color difference coding, object localization using FLAIR, instance search by one example, event recognition based on VideoStory, and event recounting using COSTA. Our experiments focus on establishing the video retrieval value of these innovations. The 2014 edition of the TRECVID benchmark has again been a fruitful participation for the MediaMill team, resulting in the best result for concept detection and object localization. |
![]() | Robert C Bolles, Brian J Burns, James A Herson, Gregory K Myers, Julien van Hout, Wen Wang, Julie Wong, Eric Yeh, Amirhossein Habibian, Dennis C Koelma, Thomas Mensink, Arnold W M Smeulders, Cees G M Snoek, Arnav Agharwal, Song Cao, Kan Chen, Rama Kovvuri, Ram Nevatia, Pramod Sharma: The 2014 SESAME Multimedia Event Detection and Recounting System. TRECVID, Orlando USA, 2014. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{SesameTRECVID14, title = {The 2014 SESAME Multimedia Event Detection and Recounting System}, author = {Robert C Bolles and Brian J Burns and James A Herson and Gregory K Myers and Julien van Hout and Wen Wang and Julie Wong and Eric Yeh and Amirhossein Habibian and Dennis C Koelma and Thomas Mensink and Arnold W M Smeulders and Cees G M Snoek and Arnav Agharwal and Song Cao and Kan Chen and Rama Kovvuri and Ram Nevatia and Pramod Sharma}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/Sesame-TRECVID2014-final.pdf}, year = {2014}, date = {2014-11-01}, booktitle = {TRECVID}, address = {Orlando USA}, abstract = {The SESAME (video SEarch with Speed and Accuracy for Multimedia Events) team submitted six runs as a full participant in the Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) evaluations. The SESAME system combines low-level visual, audio, and motion features; high-level semantic concepts for visual objects, scenes, persons, sounds, and actions; automatic speech recognition (ASR); and video optical character recognition (OCR). These three types of features and five types of concepts were used in eight event classifiers. One of the event classifiers, VideoStory, is a new approach that exploits the relationship between semantic concepts and imagery in a large training corpus. The SESAME system uses a total of over 18,000 concepts. We combined the event-detection results for these classifiers using a log-likelihood ratio (LLR) late-fusion method, which uses logistic regression to learn combination weights for event-detection scores from multiple classifiers originating from different data types. The SESAME system generated event recountings based on visual and action concepts, and on concepts recognized by ASR and OCR. Training data included the MED Research dataset, ImageNet, a video dataset from YouTube, the UCF101 and HMDB51 action datasets, the NIST SIN dataset, and Wikipedia. The components that contributed most significantly to event-detection performance were the low- and high-level visual features, low-level motion features, and VideoStory. The LLR late-fusion method significantly improved performance over the best individual classifier for 100Ex and 010Ex. For the Semantic Query (SQ), equal fusion weights, instead of the LLR method, were used in fusion due to the absence of training data.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } The SESAME (video SEarch with Speed and Accuracy for Multimedia Events) team submitted six runs as a full participant in the Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) evaluations. The SESAME system combines low-level visual, audio, and motion features; high-level semantic concepts for visual objects, scenes, persons, sounds, and actions; automatic speech recognition (ASR); and video optical character recognition (OCR). These three types of features and five types of concepts were used in eight event classifiers. One of the event classifiers, VideoStory, is a new approach that exploits the relationship between semantic concepts and imagery in a large training corpus. The SESAME system uses a total of over 18,000 concepts. We combined the event-detection results for these classifiers using a log-likelihood ratio (LLR) late-fusion method, which uses logistic regression to learn combination weights for event-detection scores from multiple classifiers originating from different data types. The SESAME system generated event recountings based on visual and action concepts, and on concepts recognized by ASR and OCR. Training data included the MED Research dataset, ImageNet, a video dataset from YouTube, the UCF101 and HMDB51 action datasets, the NIST SIN dataset, and Wikipedia. The components that contributed most significantly to event-detection performance were the low- and high-level visual features, low-level motion features, and VideoStory. The LLR late-fusion method significantly improved performance over the best individual classifier for 100Ex and 010Ex. For the Semantic Query (SQ), equal fusion weights, instead of the LLR method, were used in fusion due to the absence of training data. |
![]() | Zhenyang Li, Efstratios Gavves, Thomas Mensink, Cees G M Snoek: Attributes Make Sense on Segmented Objects. ECCV, Zürich, Switzerland, 2014. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{LiECCV14, title = {Attributes Make Sense on Segmented Objects}, author = {Zhenyang Li and Efstratios Gavves and Thomas Mensink and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/li-object-level-attributes-eccv2014.pdf}, year = {2014}, date = {2014-09-01}, booktitle = {ECCV}, address = {Zürich, Switzerland}, abstract = {In this paper we aim for object classification and segmentation by attributes. Where existing work considers attributes either for the global image or for the parts of the object, we propose, as our first novelty, to learn and extract attributes on segments containing the entire object. Object-level attributes suffer less from accidental content around the object and accidental image conditions such as partial occlusions, scale changes and viewpoint changes. As our second novelty, we propose joint learning for simultaneous object classification and segment proposal ranking, solely on the basis of attributes. This naturally brings us to our third novelty: object-level attributes for zero-shot, where we use attribute descriptions of unseen classes for localizing their instances in new images and classifying them accordingly. Results on the Caltech UCSD Birds, Leeds Butterflies, and an a-Pascal subset demonstrate that i) extracting attributes on oracle object-level brings substantial benefits ii) our joint learning model leads to accurate attribute-based classification and segmentation, approaching the oracle results and iii) object-level attributes also allow for zero-shot classification and segmentation. We conclude that attributes make sense on segmented objects.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } In this paper we aim for object classification and segmentation by attributes. Where existing work considers attributes either for the global image or for the parts of the object, we propose, as our first novelty, to learn and extract attributes on segments containing the entire object. Object-level attributes suffer less from accidental content around the object and accidental image conditions such as partial occlusions, scale changes and viewpoint changes. As our second novelty, we propose joint learning for simultaneous object classification and segment proposal ranking, solely on the basis of attributes. This naturally brings us to our third novelty: object-level attributes for zero-shot, where we use attribute descriptions of unseen classes for localizing their instances in new images and classifying them accordingly. Results on the Caltech UCSD Birds, Leeds Butterflies, and an a-Pascal subset demonstrate that i) extracting attributes on oracle object-level brings substantial benefits ii) our joint learning model leads to accurate attribute-based classification and segmentation, approaching the oracle results and iii) object-level attributes also allow for zero-shot classification and segmentation. We conclude that attributes make sense on segmented objects. |
![]() | Mihir Jain, Jan van Gemert, Cees G M Snoek: University of Amsterdam at THUMOS Challenge 2014. ECCV workshop, Zürich, Switzerland, 2014. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{JainTHUMOS14, title = {University of Amsterdam at THUMOS Challenge 2014}, author = {Mihir Jain and Jan van Gemert and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/jain-THUMOS2014-final.pdf}, year = {2014}, date = {2014-09-01}, booktitle = {ECCV workshop}, address = {Zürich, Switzerland}, abstract = {This notebook paper describes our approach for the action classification task of the THUMOS Challenge 2014. We investigate and exploit the action-object relationship by capturing both motion and related objects. As local descriptors we use HOG, HOF and MBH computed along the improved dense trajectories. For video encoding we rely on Fisher vector. In addition, we employ deep net features learned from object attributes to capture action context. All actions are classified with a one-versus-rest linear SVM.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } This notebook paper describes our approach for the action classification task of the THUMOS Challenge 2014. We investigate and exploit the action-object relationship by capturing both motion and related objects. As local descriptors we use HOG, HOF and MBH computed along the improved dense trajectories. For video encoding we rely on Fisher vector. In addition, we employ deep net features learned from object attributes to capture action context. All actions are classified with a one-versus-rest linear SVM. |
![]() | Amirhossein Habibian, Cees G M Snoek: Recommendations for Recognizing Video Events by Concept Vocabularies. Computer Vision and Image Understanding, 124 , pp. 110–122, 2014. (Type: Journal Article | Abstract | Links | BibTeX) @article{HabibianCVIU14, title = {Recommendations for Recognizing Video Events by Concept Vocabularies}, author = {Amirhossein Habibian and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/habibian-recommendations-cviu.pdf}, year = {2014}, date = {2014-07-01}, journal = {Computer Vision and Image Understanding}, volume = {124}, pages = {110--122}, abstract = {Representing videos using vocabularies composed of concept detectors appears promising for generic event recognition. While many have recently shown the benefits of concept vocabularies for recognition, studying the characteristics of a universal concept vocabulary suited for representing events is ignored. In this paper, we study how to create an effective vocabulary for arbitrary-event recognition in web video. We consider five research questions related to the number, the type, the specificity, the quality and the normalization of the detectors in concept vocabularies. A rigorous experimental protocol using a pool of 1346 concept detectors trained on publicly available annotations, two large arbitrary web video datasets and a common event recognition pipeline allow us to analyze the performance of various concept vocabulary definitions. From the analysis we arrive at the recommendation that for effective event recognition the concept vocabulary should (i) contain more than 200 concepts, (ii) be diverse by covering object, action, scene, people, animal and attribute concepts, (iii) include both general and specific concepts, (iv) increase the number of concepts rather than improve the quality of the individual detectors, and (v) contain detectors that are appropriately normalized. We consider the recommendations for recognizing video events by concept vocabularies the most important contribution of the paper, as they provide guidelines for future work.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Representing videos using vocabularies composed of concept detectors appears promising for generic event recognition. While many have recently shown the benefits of concept vocabularies for recognition, studying the characteristics of a universal concept vocabulary suited for representing events is ignored. In this paper, we study how to create an effective vocabulary for arbitrary-event recognition in web video. We consider five research questions related to the number, the type, the specificity, the quality and the normalization of the detectors in concept vocabularies. A rigorous experimental protocol using a pool of 1346 concept detectors trained on publicly available annotations, two large arbitrary web video datasets and a common event recognition pipeline allow us to analyze the performance of various concept vocabulary definitions. From the analysis we arrive at the recommendation that for effective event recognition the concept vocabulary should (i) contain more than 200 concepts, (ii) be diverse by covering object, action, scene, people, animal and attribute concepts, (iii) include both general and specific concepts, (iv) increase the number of concepts rather than improve the quality of the individual detectors, and (v) contain detectors that are appropriately normalized. We consider the recommendations for recognizing video events by concept vocabularies the most important contribution of the paper, as they provide guidelines for future work. |
![]() | Mihir Jain, Jan C van Gemert, Hervé Jégou, Patrick Bouthemy, Cees G M Snoek: Action Localization by Tubelets from Motion. CVPR, Columbus, Ohio, USA, 2014. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{JainCVPR14, title = {Action Localization by Tubelets from Motion}, author = {Mihir Jain and Jan C van Gemert and Hervé Jégou and Patrick Bouthemy and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/jain-tubelets-cvpr2014.pdf}, year = {2014}, date = {2014-06-01}, booktitle = {CVPR}, address = {Columbus, Ohio, USA}, abstract = {This paper considers the problem of action localization, where the objective is to determine when and where certain actions appear. We introduce a sampling strategy to produce 2D+t sequences of bounding boxes, called tubelets. Compared to state-of-the-art alternatives, this drastically reduces the number of hypotheses that are likely to include the action of interest. Our method is inspired by a recent technique introduced in the context of image localization. Beyond considering this technique for the first time for videos, we revisit this strategy for 2D+t sequences obtained from super-voxels. Our sampling strategy advantageously exploits a criterion that reflects how action related motion deviates from background motion. We demonstrate the interest of our approach by extensive experiments on two public datasets: UCF Sports and MSR-II. Our approach significantly outperforms the state-of-the-art on both datasets, while restricting the search of actions to a fraction of possible bounding box sequences.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } This paper considers the problem of action localization, where the objective is to determine when and where certain actions appear. We introduce a sampling strategy to produce 2D+t sequences of bounding boxes, called tubelets. Compared to state-of-the-art alternatives, this drastically reduces the number of hypotheses that are likely to include the action of interest. Our method is inspired by a recent technique introduced in the context of image localization. Beyond considering this technique for the first time for videos, we revisit this strategy for 2D+t sequences obtained from super-voxels. Our sampling strategy advantageously exploits a criterion that reflects how action related motion deviates from background motion. We demonstrate the interest of our approach by extensive experiments on two public datasets: UCF Sports and MSR-II. Our approach significantly outperforms the state-of-the-art on both datasets, while restricting the search of actions to a fraction of possible bounding box sequences. |
![]() | Thomas Mensink, Efstratios Gavves, Cees G M Snoek: COSTA: Co-Occurrence Statistics for Zero-Shot Classification. CVPR, Columbus, Ohio, USA, 2014. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{MensinkCVPR14, title = {COSTA: Co-Occurrence Statistics for Zero-Shot Classification}, author = {Thomas Mensink and Efstratios Gavves and Cees G M Snoek}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/mensink-co-occurrence-cvpr2014.pdf}, year = {2014}, date = {2014-06-01}, booktitle = {CVPR}, address = {Columbus, Ohio, USA}, abstract = {In this paper we aim for zero-shot classification, that is visual recognition of an unseen class by using knowledge transfer from known classes. Our main contribution is COSTA, which exploits co-occurrences of visual concepts in images for knowledge transfer. These inter-dependencies arise naturally between concepts, and are easy to obtain from existing annotations or web-search hit counts. We estimate a classifier for a new label, as a weighted combination of related classes, using the co-occurrences to define the weight. We propose various metrics to leverage these co-occurrences, and a regression model for learning a weight for each related class. We also show that our zero-shot classifiers can serve as priors for few-shot learning. Experiments on three multi-labeled datasets reveal that our proposed zero-shot methods, are approaching and occasionally outperforming fully supervised SVMs. We conclude that co-occurrence statistics suffice for zero-shot classification.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } In this paper we aim for zero-shot classification, that is visual recognition of an unseen class by using knowledge transfer from known classes. Our main contribution is COSTA, which exploits co-occurrences of visual concepts in images for knowledge transfer. These inter-dependencies arise naturally between concepts, and are easy to obtain from existing annotations or web-search hit counts. We estimate a classifier for a new label, as a weighted combination of related classes, using the co-occurrences to define the weight. We propose various metrics to leverage these co-occurrences, and a regression model for learning a weight for each related class. We also show that our zero-shot classifiers can serve as priors for few-shot learning. Experiments on three multi-labeled datasets reveal that our proposed zero-shot methods, are approaching and occasionally outperforming fully supervised SVMs. We conclude that co-occurrence statistics suffice for zero-shot classification. |
![]() | Koen E A van de Sande, Cees G M Snoek, Arnold W M Smeulders: Fisher and VLAD with FLAIR. CVPR, Columbus, Ohio, USA, 2014. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{SandeCVPR14, title = {Fisher and VLAD with FLAIR}, author = {Koen E A van de Sande and Cees G M Snoek and Arnold W M Smeulders}, url = {http://isis-data.science.uva.nl/cgmsnoek/pub/sande-flair-cvpr2014.pdf}, year = {2014}, date = {2014-06-01}, booktitle = {CVPR}, address = {Columbus, Ohio, USA}, abstract = {A major computational bottleneck in many current algorithms is the evaluation of arbitrary boxes. Dense local analysis and powerful bag-of-word encodings, such as Fisher vectors and VLAD, lead to improved accuracy at the expense of increased computation time. Where a simplification in the representation is tempting, we exploit novel representations while maintaining accuracy. We start from state-of-the-art, fast selective search, but our method will apply to any initial box-partitioning. By representing the picture as sparse integral images, one per codeword, we achieve a Fast Local Area Independent Representation. FLAIR allows for very fast evaluation of any box encoding and still enables spatial pooling. In FLAIR we achieve exact VLADs difference coding, even with l2 and power-norms. Finally, by multiple codeword assignments, we achieve exact and approximate Fisher vectors with FLAIR. The results are a 18x speedup, which enables us to set a new state-of-the-art on the challenging 2010 PASCAL VOC objects and the fine-grained categorization of the CUB-2011 200 bird species. Plus, we rank number one in the official ImageNet 2013 detection challenge.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } A major computational bottleneck in many current algorithms is the evaluation of arbitrary boxes. Dense local analysis and powerful bag-of-word encodings, such as Fisher vectors and VLAD, lead to improved accuracy at the expense of increased computation time. Where a simplification in the representation is tempting, we exploit novel representations while maintaining accuracy. We start from state-of-the-art, fast selective search, but our method will apply to any initial box-partitioning. By representing the picture as sparse integral images, one per codeword, we achieve a Fast Local Area Independent Representation. FLAIR allows for very fast evaluation of any box encoding and still enables spatial pooling. In FLAIR we achieve exact VLADs difference coding, even with l2 and power-norms. Finally, by multiple codeword assignments, we achieve exact and approximate Fisher vectors with FLAIR. The results are a 18x speedup, which enables us to set a new state-of-the-art on the challenging 2010 PASCAL VOC objects and the fine-grained categorization of the CUB-2011 200 bird species. Plus, we rank number one in the official ImageNet 2013 detection challenge. |
This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.