ISOMER

The ICMR2014 paper ISOMER: Informative Segment Observations for Multimedia Event Recounting by Chen Sun, Brian Burns, Ram Nevatia, Cees G. M. Snoek, Bob Bolles, Greg Myers, Wen Wang and Eric Yeh is now available. This paper describes a system for multimedia event detection and recounting. The goal is to detect a high level event class in unconstrained web videos and generate event oriented summarization for display to users. For this purpose, we detect informative segments and collect observations for them, leading to our ISOMER system. We combine a large collection of both low level and semantic level visual and audio features for event detection. For event recounting, we propose a novel approach to identify event oriented discriminative video segments and their descriptions with a linear SVM event classifier. User friendly concepts including objects, actions, scenes, speech and optical character recognition are used in generating descriptions. We also develop several mapping and filtering strategies to cope with noisy concept detectors. Our system performed competitively in the TRECVID 2013 Multimedia Event Detection task with near 100,000 videos and was the highest performer in TRECVID 2013 Multimedia Event Recounting task.

Four papers got accepted by the leading IEEE Conference on Computer Vision and Pattern Recognition (CVPR). This is a new Dutch record. CVPR is the only conference in the top-100 of most cited sources by Google Scholar, which further consists only of journals. The lists starts with Nature followed by many major journals from other fields such as PLoS One at 36, Nature Neuroscience at 73, and Astronomy and Astrophysics at 99. It is no surprise that the only conference is on computer science, as progress in this field is fast.

The accepted papers are:

  • Locality in Generic Instance Search from One Example. Ran Tao, Efstratios Gavves, Cees G. M. Snoek, and Arnold W.M. Smeulders.
  • Fisher and VLAD with FLAIR. Koen E.A. van de Sande, Cees G.M. Snoek, and Arnold W.M. Smeulders.
  • Co-Occurrence Statistics for Zero-Shot Classification. Thomas Mensink, Efstratios Gavves, and Cees G.M. Snoek.
  • Action Localization by Tubelets from Motion. Mihir Jain, Jan C. van Gemert, Patrick Bouthemy, Hervé Jégou and Cees G.M. Snoek.

The papers will be presented during the IEEE Conference on Computer Vision and Pattern Recognition, from 24- 27 June 24 in Columbus, Ohio, USA.

Arnold Smeulders, Laurens van der Maaten and myself are organizing a new Ph.D. course on Computer Vision by Learning. The first edition will happen from March 25 to March 31, in Amsterdam. This ASCI course is especially meant for Ph.D. students who have basic familiarity with computer vision, image processing, and pattern recognition and want to upsurge their knowledge and machinery to the state-of-the-art, with direct utility in their own research. The topic of attention is the challenges of computer vision by learning. We address the theoretical foundations of machine learning in conjunction with computer vision and present algorithms that achieve state-of-the-art performance while maintaining efficient execution with minimal supervision. We explain and emphasize machine learning for vision tasks like concept detection with deep learning, fine-grained categorization using kernel pooling, semantic segmentation with conditional random fields, object tracking by structured SVMs, event recognition by random forests and retrieval from a single image by metric learning. We give an overview of the latest developments and future trends in the field on the basis of several recent challenges, including the TRECVID and ImageNet competitions, the leading competitions for visual search engines based on computer vision by learning, and we indicate how to obtain improvements in the near future. The course will close with an invited tutorial by the renown prof. Shih-Fu Chang from Columbia University, USA.

ImageNet logo

UvA-Euvision Team Presents at ImageNet Workshop

Amidst fierce competition the UvA-Euvision team participated in the new ImageNet object detection task where the goal is to tell what object is in an image and where it is located. The organizers defined 200 basic-level categories for this task (e.g. accordion, airplane, ant, antelope and apple) . The categories were carefully chosen considering different factors such as object scale, level of image clutterness, average number of object instance, and several others.

The number of categories won by the University of Amsterdam – Euvision Technologies team is 130, out of 200.

The purpose of the workshop is to present the methods and results of the Image Net Large Scale Visual Recognition Challenge (ILSVRC) 2013. Challenge participants with the most successful and innovative entries are invited to present, and the UvA-Euvision team is amongst them.

The ImageNet 2013 Detection Task

To summarize our participation, for task 1, the ILSVRC2013 detection task on 200 classes, we submit two runs. Our runs utilize a new way of efficient encoding. The method is currently under submission, therefore we can not include identifying details on this part. The submission utilizes selective search (Uijlings et al. IJCV 2013) to create on many candidate boxes per image. These boxes are represented by extracting densely sampled color SIFT descriptors (van de Sande et al, PAMI 2010) at multiple scales. The box is then encoded with our new efficient coding. The method is faster than bag-of-words with hard assignment and outperforms it in terms of accuracy. Each box is encoded with a multi-level spatial pyramid. Training follows a standard negative mining procedure based on the previous work. The first run is context-free. The 200 models are trained independently of one another. The second run utilizes a convolutional network, trained on the DET dataset, to compute a prior for the presence of an object in the image.

The ImageNet 2013 Classification Task

For task 2, the ILSVRC2013 classification task on 1,000 classes, we submit two runs.Our showcase run performs all evaluations of the test set on an iPhone 5s at a rate of 2 images per second, whereas on the iPhone 4 it has a performance of 1 image per 10 seconds. The results in the main run are based on the fusion of convolutional networks. The networks are compatible to the networks that won this task last year (Krizhevsky et al, NIPS 2012), where our networks have 76M free parameters. The parameters have been trained for 300 epochs on a single GPU. For training in both runs we have used the ImageNet 1,000 dataset. No (pre-)training on other datasets has been performed.

Demo on iPhone Available

At the ILSVRC2013 workshop we will release an app in the App Store performing instant interactive photo classification (take a picture, see the top 5 ImageNet scores).  This app uses the same engine as our Impala app that is already available at: https://itunes.apple.com/us/app/impala/id736620048 . The Impala app user interface was designed for the experience that the iPhone works for you, but can still be optimized. The current results reflect the match of the training data with the personal data on the iPhone.

December 7 in Sydney, Australia

The ImageNet workshop is held December 7 in Sydney. The workshop is organized in conjunction with the International Conference on Computer Vision.

AUTOMATIC IMAGE CLASSIFICATION ON YOUR PHONE

Impala by Euvision is the first app in the world that automatically sorts the photos on your phone. You do not have to manually label each and every one of them. Impala “looks” into your images and videos and recognizes what they area about.

landscape-2

query-by-videoThe ACM Multimedia’13 paper on “Querying for Video Events by Semantic Signatures from Few Examples” by Masoud Mazloom, Amirhossein Habibian and Cees Snoek is now available. We aim to query web video for complex events using only a handful of video query examples, where the standard approach learns a ranker from hundreds of examples. We consider a semantic signature representation, consisting of off -the-shelf concept detectors, to capture the variance in semantic appearance of events. Since it is unknown what similarity metric and query fusion to use in such an event retrieval setting, we perform three experiments on unconstrained web videos from the TRECVID event detection task. It reveals that: retrieval with semantic signatures using normalized correlation as similarity metric outperforms a low-level bag-of-words alternative, multiple queries are best combined using late fusion with an average operator, and event retrieval is preferred over event classi cation when less than eight positive video examples are available.

Video2Sentence
The ACM Multimedia’13 demonstrator paper on “Video2Sentence and Vice Versa” by Amirhossein Habibian and Cees Snoek is now available. In this technical demonstration, we showcase a multimedia search engine that retrieves a video from a sentence, or a sentence from a video. The key novelty is our machine translation capability that exploits a cross-media representation for both the visual and textual modality using concept vocabularies. We will demonstrate the translations using arbitrary web videos and sentences related to everyday events. What is more, we will provide an automatically generated explanation, in terms of concept detectors, on why a particular video or sentence has been retrieved as the most likely translation.

Codemaps
The ICCV13 paper entitled “Codemaps Segment, Classify and Search Objects Locally” by Zhenyang Li, Efstratios Gavves, Koen van de Sande, Cees Snoek, and Arnold Smeulders is now also available. In this paper we aim for segmentation and classification of objects. We propose codemaps that are a joint formulation of the classification score and the local neighborhood it belongs to in the image. We obtain the codemap by reordering the encoding, pooling and classification steps over lattice elements. Other than existing linear decompositions who emphasize only the efficiency benefits for localized search, we make three novel contributions. As a preliminary, we provide a theoretical generalization of the sufficient mathematical conditions under which image encodings and classification becomes locally decomposable. As first novelty we introduce l2 normalization for arbitrarily shaped image regions, which is fast enough for semantic segmentation using our Fisher codemaps. Second, using the same lattice across images, we propose kernel pooling which embeds nonlinearities into codemaps for object classification by explicit or approximate feature mappings. Results demonstrate that ℓ2 normalized Fisher codemaps improve the state-of-the-art in semantic segmentation for PASCAL VOC. For object classification the addition of nonlinearities brings us on par with the state-of-the-art, but is 3x faster. Because of the codemaps’ inherent efficiency, we can reach significant speed-ups for localized search as well. We exploit the efficiency gain for our third novelty: object segment retrieval using a single query image only.

fine-grained-alignments

The ICCV13 paper entitled “Fine-Grained Categorization by Alignments” by Efstratios Gavves, Basura Fernando, Cees Snoek, Arnold Smeulders, and Tinne Tuytelaars is now available. The aim of this paper is fine-grained categorization without human interaction. Different from prior work, which relies on detectors for specific object parts, we propose to localize distinctive details by roughly aligning the objects using just the overall shape, since implicit to fine-grained categorization is the existence of a super-class shape shared among all classes. The alignments are then used to transfer part annotations from training images to test images (supervised alignment), or to blindly yet consistently segment the object in a number of regions (unsupervised alignment). We furthermore argue that in the distinction of fine-grained sub-categories, classification-oriented encodings like Fisher vectors are better suited for describing localized information than popular matching oriented features like HOG. We evaluate the method on the CU-2011 Birds and Stanford Dogs fine-grained datasets, outperforming the state-of-the-art.

The ACM Multimedia 2013 paper “Classifying Tag Relevance with Relevant Positive and Negative Examples” by Xirong Li and Cees Snoek is now available. Image tag relevance estimation aims to automatically determine what people label about images is factually present in the pictorial content. Different from previous works, which either use only positive examples of a given tag or use positive and random negative examples, we argue the importance of relevant positive and relevant negative examples for tag relevance estimation. We propose a system that selects positive and negative examples, deemed most relevant with respect to the given tag from crowd-annotated images. While applying models for many tags could be cumbersome, our system trains efficient ensembles of Support Vector Machines per tag, enabling fast classification. Experiments on two benchmark sets show that the proposed system compares favorably against five present day methods. Given extracted visual features, for each image our system can process up to 3,787 tags per second. The new system is both effective and efficient for tag relevance estimation.