The CVPR 2019 paper Spherical Regression: Learning Viewpoints, Surface Normals and 3D Rotations on n-Spheres by Shuai Liao, Stratis Gavves and Cees Snoek is now available. Many computer vision challenges require continuous outputs, but tend to be solved by discrete classification. The reason is classification’s natural containment within a probability n-simplex, as defined by the popular softmax activation function. Regular regression lacks such a closed geometry, leading to unstable training and convergence to suboptimal local minima. Starting from this insight we revisit regression in convolutional neural networks. We observe many continuous output problems in computer vision are naturally contained in closed geometrical manifolds, like the Euler angles in viewpoint estimation or the normals in surface normal estimation. A natural framework for posing such continuous output problems are n-spheres, which are naturally closed geometric manifolds defined in the R^{(n+1)} space. By introducing a spherical exponential mapping on n-spheres at the regression output, we obtain well-behaved gradients, leading to stable training. We show how our spherical regression can be utilized for several computer vision challenges, specifically viewpoint estimation, surface normal estimation and 3D rotation estimation. For all these problems our experiments demonstrate the benefit of spherical regression. All paper resources are available at

The CVPR 2019 paper Dance with Flow: Two-in-One Stream Action Detection by Jiaojiao Zhao and Cees Snoek is now available. The goal of this paper is to detect the spatio-temporal extent of an action. The two-stream detection network based on RGB and flow provides state-of-the-art accuracy at the expense of a large model-size and heavy computation. We propose to embed RGB and optical-flow into a single two-in-one stream network with new layers. A motion condition layer extracts motion information from flow images, which is leveraged by the motion modulation layer to generate transformation parameters for modulating the low-level RGB features. The method is easily embedded in existing appearance- or two-stream action detection networks, and trained end-to-end. Experiments demonstrate that leveraging the motion condition to modulate RGB features improves detection accuracy. With only half the computation and parameters of the state-of-the-art two-stream methods, our two-in-one stream still achieves impressive results on UCF101-24, UCFSports and J-HMDB.

The paper “Pointly-Supervised Action Localization” by Pascal Mettes and Cees Snoek has been published in the International Journal of Computer Vision. The paper strives for spatio-temporal localization of human actions in videos. In the literature, the consensus is to achieve localization by training on bounding box annotations provided for each frame of each training video. As annotating boxes in video is expensive, cumbersome and error-prone, we propose to bypass box-supervision. Instead, we introduce action localization based on point-supervision. We start from unsupervised spatio-temporal proposals, which provide a set of candidate regions in videos. While normally used exclusively for inference, we show spatio-temporal proposals can also be leveraged during training when guided by a sparse set of point annotations. We introduce an overlap measure between points and spatio-temporal proposals and incorporate them all into a new objective of a Multiple Instance Learning optimization. During inference, we introduce pseudo-points, visual cues from videos, that automatically guide the selection of spatio-temporal proposals. We outline five spatial and one temporal pseudo-point, as well as a measure to best leverage pseudo-points at test time. Experimental evaluation on three action localization datasets shows our pointly-supervised approach (i) is as effective as traditional box-supervision at a fraction of the annotation cost, (ii) is robust to sparse and noisy point annotations, (iii) benefits from pseudo-points during inference, and (iv) outperforms recent weakly-supervised alternatives. This leads us to conclude that points provide a viable alternative to boxes for action localization.

This summer Qualcomm, the world-leader in mobile chip-design, and the University of Amsterdam, a world-leading computer science department, have started a joint research lab in Amsterdam, the Netherlands, as a great opportunity to join the best of academic and industrial research. Leading the lab are profs. Max Welling (machine learning), Arnold Smeulders (computer vision analysis), and Cees Snoek (image categorization). The lab will pursue world-class research on computer vision and machine learning. We are looking for 3 postdoctoral researchers and 8 PhD candidates in Computer Vision and Deep Learning.

This week, Barcelona hosts the IEEE International Conference on Computer Vision. Judging from the paper titles, the focus will be on learning to recognize objects in images.

We make video search engines. With these search engines we participate in international competitions, often with excellent results. While good progress has been achieved over the past years, the video search engines are not precise enough, yet. We have been invited by SRI International and the University of Southern California to join the US ALADDIN program, whose goal is to develop a precise and efficient video search engine able to retrieve specific events involving people interacting with other people and objects. The ambitious goal of our project is to arrive at a video search engine capable to automatically retrieve complex events with high precision.

Within the project we have open positions for:

We will start reviewing applications on 20 December 2010 and hope to make a decision soon after that, but applications will continue to be accepted until all positions are filled.

For questions contact: Dr. Cees Snoek at cgmsnoek AT uva DOT nl

Our Pinkpop video search engine is generating some media attention, including coverage in national news papers and television. It is likely that more concert footage will be added in the coming weeks, so stay tuned at:

I am quite excited that the technology is finally finding its way to a broad audience on very interesting video assets, see for example the concert of Moke on Pinkpop 2008. This is my best Sinterklaas present in years.