Trying to find a specific video clip on the internet can be a time-consuming affair. In many cases, images will not have been tagged with an adequate description. Computer scientist Cees Snoek is working to make video retrieval a lot easier. He is currently developing a search engine capable of recognizing specific images. ‘I’m working to translate pixels into text.’
Cees Snoek, a staff member at the Institute for Computer Science’s Intelligent Systems Lab Amsterdam (ISLA), has spent the past few years developing a computer-based video recognition system. His efforts so far have certainly been successful: Snoek and his colleagues have consistently placed first in the annual competition in which all universities and major commercial players in the sector participate. So just how does their MediaMill video search engine work?
Four thousand distinguishing features
In order to recognize an object or setting in a photograph or video, a computer needs to know what it is looking for. This is why novel video search engines need a large amount of learning examples. Snoek feeds the search engine a huge quantity of image fragments that can be linked to a specific search query. The search engine then assesses each image in terms of approximately 4000 distinguishing features, such as variations in color, texture, and shape. Based on this analysis, the search engine will determine a characteristic correlation between the specific distinguishing features and the search query entered by the user.
The statistical model derived from this analysis, known as a concept detector, can then be utilized to search an enormous database for other images corresponding to this model. Watch the Semantic Pathfinder video.
Images that correlate with the model must then be presented to the user. This involves what is known as a CrossBrowser. The vertical axis shows the video fragments identified by the system, while the horizontal axis displays the timeline of a single video clip. This feature is extremely useful, as each video clip consists of a large amount of individual shots. When the image search engine finds a suitable result, the shots preceding and following this result also tend to match the search criteria.
Snoek offers a demonstration. He enters the relatively simple search query ‘boat’. The program successfully identifies a large number of boats from the enormous dataset. The search results also include an offshore drilling platform and a car negotiating a flooded road. ‘As you can see, it has still made a few mistakes’, Snoek admits. ‘The software mainly focuses on texture and color, whereas people tend to focus on shape. The software hasn’t been developed to the point where it can apply this aspect as much as we’d like. On the whole, though, it is quite successful in picking out the right images.’
Search engine competitions
‘The ability to search on the basis of images rather than having to depend on textual tags is incredibly useful’, Snoek explains. This is clearly borne out by widespread interest in the problem: in addition to the ISLA search engine, some 50 teams from various research institutes, universities and companies are currently working to develop video search engines. There is even an annual competition.
Participants all use the same test set, for example an enormous quantity of video material from the Netherlands Institute for Sound and Vision archive. The objective is to carry out a specific search query as quickly and accurately as possible, such as identifying fragments that feature a kitchen. Snoek’s approach works as follows. To begin with, he labels all shots from the set as either ‘kitchen’ or ‘not kitchen’. He then divides the set into two parts, a test set and a training set. He uses the training set to identify the correlation between 4000 features characteristic of the image. This yields a concept detector, which is then applied to the test set. Finally, he verifies the model’s accuracy: in which percentage of cases did the system actually identify kitchen rather than, say, a bathroom? He also assesses how often the computer failed to identify images of a kitchen. Watch the VideOlympics showcase video.
A time-consuming task
The process of labelling images from a training set is incredibly labour-intensive. Snoek spent the mornings and evenings of an entire summer month labelling pictures, arriving at a total of 101 individual categories. He then assessed the correlation between the number of learning examples and the system’s performance. As it turned out, the system had relatively little difficulty finding ‘boat’ on the basis of a small number of examples. ‘Mobile phone’, however, proved to be a lot more difficult. ‘The main problem is the image background. A boat is basically a hole in the water. Mobile phones, on the other hand, can be used anywhere and are more difficult to identify.’
Snoek and his group have proven extremely successful in recognising images. However, Snoek admits their success in certain areas of the competition cannot be attributed solely to the quality of the software. ‘If you’ve spent entire summers tagging video material, as I have, you become extremely adept at recognising images; you’re trained for the job. In other words, your success in retrieval also depends on the person behind the keyboard.’
However, this method of information exchange has its drawbacks. ‘It is becoming increasingly clear to us that learning examples derived from one dataset are not necessarily effective when applied to another dataset. For example, definitions derived from consumer photos on Flickr are often difficult to apply to images from the Institute for Sound and Vision archive.’ As of yet, Snoek has not been able to determine exactly why this is the case. ‘It may have something to do with the fact that material taken from television is often filmed from the same camera positions, and lit in a specific way, while images on Flickr are much less consistent. In addition, the software tends to focus on the entire image, while humans tend to filter out a specific shape in the foreground. We’re still working to find out exactly how this process works.’
Early December will see the launch of a website showcasing the achievements of Snoek and his researchers. Visitors can search the site for images from Pinkpop music festival television broadcasts. ‘Due to copyright issues, the search function will be limited to recordings of Dutch artists.’ Users can provide feedback on the search engine’s effectiveness, benefiting both parties: users can use the site to search for video material, while at the same time helping Snoek to improve his search engine. The video search engine is available on: http://www.hollandsglorieoppinkpop.nl/
Previously I was supported by a STW VENI grant and a Fulbright scholarship.
My work as a post-doc has been sponsored by the ICES/KIS MultimediaN project. My PhD research was sponsored by the ICES/KIS MIA project and TNO.
Text largely based on interview by Edda Heinsman.