The paper Social Negative Bootstrapping for Visual Categorization was presented by Xirong Li at ACM’s International Conference on Multimedia Retrieval and is now available for download. To learn classifiers for many visual categories, obtaining labeled training examples in an efficient way is crucial. Since a classifier tends to misclassify negative examples which are visually similar to positive examples, inclusion of such informative negatives should be stressed in the learning process. However, they are unlikely to be hit by random sampling, the de facto standard in literature. In this paper, we go beyond random sampling by introducing a novel social negative bootstrapping approach. Given a visual category and a few positive examples, the proposed approach adaptively and iteratively harvests informative negatives from a large amount of social-tagged images. To label negative examples without human interaction, we design an effective virtual labeling procedure based on simple tag reasoning. Virtual labeling, in combination with adaptive sampling, enables us to select the most misclassified negatives as the informative samples. Learning from the positive set and the informative negative sets results in visual classifiers with higher accuracy. Experiments on two present-day image benchmarks employing 650K virtually labeled negative examples show the viability of the proposed approach. On a popular visual categorization benchmark our precision at 20 increases by 34%, compared to baselines trained on randomly sampled negatives. We achieve more accurate visual categorization without the need of manually labeling any negatives.

As of today my website will be accessible from the personalized url: http://www.CeesSnoek.info

Empowering Visual Categorization with the GPU

The paper “Empowering Visual Categorization with the GPU” by Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek is now officially published in IEEE Transactions on Multimedia. In this paper, we analyze the bag-of-words model for visual categorization, the most powerful method in the literature, in terms of computational cost and identify two major bottlenecks: the quantization step and the classification step. We address these two bottlenecks by proposing two efficient algorithms for quantization and classification by exploiting the GPU hardware and the CUDA parallel programming model. The algorithms are designed to 1) keep categorization accuracy intact, 2) decompose the problem, and 3) give the same numerical results. In the experiments on large scale datasets, it is shown that, by using a parallel implementation on the Geforce GTX260GPU, classifying unseen images is 4.8 times faster than a quad-core CPU version on the Core i7 920, while giving the exact same numerical results. In addition, we show how the algorithms can be generalized to other applications, such as text retrieval and video retrieval. Moreover, when the obtained speedup is used to process extra video frames in a video retrieval benchmark, the accuracy of visual categorization is improved by 29%.

We make video search engines. With these search engines we participate in international competitions, often with excellent results. While good progress has been achieved over the past years, the video search engines are not precise enough, yet. We have been invited by SRI International and the University of Southern California to join the US ALADDIN program, whose goal is to develop a precise and efficient video search engine able to retrieve specific events involving people interacting with other people and objects. The ambitious goal of our project is to arrive at a video search engine capable to automatically retrieve complex events with high precision.

Within the project we have open positions for:

We will start reviewing applications on 20 December 2010 and hope to make a decision soon after that, but applications will continue to be accepted until all positions are filled.

For questions contact: Dr. Cees Snoek at cgmsnoek AT uva DOT nl

synonym_explanation

The forthcoming ACM Multimedia 2010 paper on Landmark Image Retrieval Using Visual Synonyms by Efstratios Gavves and Cees Snoek is now available. In this paper, we consider the incoherence problem of the visual words in bag-of-words vocabularies. Different from existing work, which performs assignment of words based solely on closeness in descriptor space, we focus on identifying pairs of independent, distant words – the visual synonyms – that are still likely to host image patches with similar appearance. To study this problem, we focus on landmark images, where we can examine whether image geometry is an appropriate vehicle for detecting visual synonyms. We propose an algorithm for the extraction of visual synonyms in landmark images. To show the merit of visual synonyms, we perform two experiments. We examine closeness of synonyms in descriptor space and we show a first application of visual synonyms in a landmark image retrieval setting. Using visual synonyms, we perform on par with the state-of-the-art, but with six times less visual words.

And another ACM Multimedia 2010 paper titled: Crowdsourcing Rock N’ Roll Multimedia Retrieval by Cees Snoek, Bauke Freiburg, Johan Oomen, and Roeland Ordelman is also available online.

Crowdsourcing music video

In this technical demonstration, we showcase a multimedia search engine that facilitates semantic access to archival rock n’ roll concert video. The key novelty is the crowdsourcing mechanism, which relies on online users to improve, extend, and share, automatically detected results in video fragments using an advanced timeline-based video player. The user-feedback serves as valuable input to further improve automated multimedia retrieval results, such as automatically detected concepts and automatically transcribed interviews. The search engine has been operational online to harvest valuable feedback from rock n’ roll enthusiasts.

The ACM Multimedia 2010 paper entitled Keep Moving! Revisiting Thumbnails for Mobile Video Retrieval by Wolfgang Hürst, Cees G. M. Snoek, Willem-Jan Spoel, and Mate Tomin is available online.

Motivated by the increasing popularity of video on handheld devices and the resulting importance for effective video retrieval, this paper revisits the relevance of thumbnails in a mobile video retrieval setting. In particular, we quantified the usage of static and dynamic thumbnails for interactive video retrieval on a handheld device. Contrary to widespread believe that screens of handheld devices are unsuited for visualizing multiple (small) thumbnails simultaneously, our results suggest that users are quite able to handle and assess multiple thumbnails, especially when they are showing moving images. This result suggests promising avenues for future research with respect to the design and interaction with advanced video retrieval interfaces on mobile devices. Although the limited screen estate of handheld devices allows for less advanced video retrieval interfaces than those common for the desktop, they can be still much more complex that one would assume, especially when they rely on moving images. Therefore, when designing mobile video retrieval interfaces we recommend keep moving!

computer-cover

The June issue of IEEE Computer Magazine features an article by myself and Arnold Smeulders titled “Visual-Concept Search Solved?“, which is available for download here. Interpreting the visual signal that enters the brain is an amazingly complex task, deeply rooted in life experience. Approximately half the brain is engaged in assigning a meaning to the incoming image, starting with the categorization of all visual concepts in the scene. Nevertheless, during the past five years, the field of computer vision has made considerable progress. It has done so not on the basis of precise modeling of all encountered objects and scenes—that task would be too complex and exhaustive to execute—but on the basis of combining rich, sensory-invariant descriptions of all patches in the scene into semantic classes learned from a limited number of examples. Research has reached the point where one part of the community suggests visual search is practically solved and progress has only been incremental, while another part argues that current solutions are weak and generalize poorly. We’ve done an experiment to shed light on the issue. Contrary to the widespread belief that visual-search progress is incremental and detectors generalize poorly, our experiment shows that progress has doubled on both counts in just three years. These results suggest that machine understanding of images is within reach.

phd051210s

Credit: PhD comics.