By 2022 there will be 45 billion cameras in the world, many of them tiny, connected and live streaming 24/7. Self-driving cars, drones and service robots are just three manifestations. The cameras may even capture video data beyond the visual spectrum. The embedded systems in which these cameras are integrated, come with CPU, GPU and DSP processors powerful enough to run video understanding algorithms founded on computer vision and deep learning. This is an invitation to move away from traditional video understanding domains, like broadcast news, television archives and social media. And, instead, emphasize hitherto non-mainstream video domains like surveillance, healthcare and robotics where viewpoints are new, labeled examples are scarce and real-time spatiotemporal understanding is crucial. All these developments open up exciting research avenues for video understanding.
The new tenure tracker is expected to contribute to fundamental research in video understanding via edge computing. We anticipate that the field of video understanding combined with embedded cameras will drive the next wave of innovation in this field. The tenure tracker is expected to have a keen interest and expertise in this area, with a focus on efficient video understanding algorithms for inference and training on edge computing hardware.
The tenure tracker is expected to acquire his/her own independent funding from sources such as the national funding agency NWO (e.g. VIDI), EU funding via H2020 (e.g. ERC starting grant) and industry. In terms of teaching, the tenure tracker will contribute to strengthening the curriculum of the Bachelor and Master AI and related programs such as Computer Science. The teaching load is around 30%. The tenure tracker is expected to contribute to valorization, both in terms of engaging with the media as well as applying the state of the art research tools to applications in society. UvA spinoffs that bring technology to industry and society are highly encouraged. Finally, the tenure tracker is expected to help with ISIS lab management.
Full vacancy and application requirements here.
The primary goal of the two PhD projects is to perform cutting edge research in computer vision and deep learning to automatically detect activities in a multi-camera streaming video environment. Activities will be enriched by person and object detection to arrive at precise descriptions. Relevant research questions are: How can we automatically detect and track activities, as well as persons and objects, in cluttered scenes? What are the underlying mechanisms for spatiotemporal reasoning in video that improve activity detection? How can we enable activity detection and scene understanding from overlapping and non-overlapping (first-person) camera viewpoints? The work is executed as part of the IARPA DIVA research program, together with SRI International, the University of Michigan and the University of Washington. The positions are based at the University of Amsterdam and we expect that candidates are willing to attend yearly project meetings in the USA.
More info at: http://bit.ly/diva2-uva
We are witnessing a revolution in machine learning with the reinvigorated usage of neural networks in deep learning, which promises a solution to cognitive tasks that are easy for humans to perform but hard to describe formally. It is intended to allow computers to acquire knowledge directly from data without the need for human to specify, and model the inherent problem in terms of a layered composition of simpler concepts making it possible to express complex problems by elementary operators. By not relying on handcrafted features, hard-coded knowledge and showing the ability to regress intricate objective functions, deep learning methods are now employed in a broad spectrum of applications from image classification to speech recognition. Deep learning achieves exceptional power and flexibility by learning to represent the task as a nested hierarchy of layers, with more abstract representations computed in terms of less abstract ones. The current resurgence is a result of the breakthroughs in efficient layer-wise training, availability of big datasets, and faster computers. Thanks to the simplified training of very deep architectures, today we can provide these algorithms with the resources they need to succeed.
A number of challenges are being raised and pursued. For instance, many deep learning algorithms have been designed to tackle supervised learning problem for a wide variety of tasks, and how to reliably solve unsupervised learning problems in a similar degree of success is an important issue to address. Another key research area is to work successfully with smaller datasets, focusing on how we can take advantage of large quantities of unlabeled examples with a few labeled samples. Deep agents may play a more significant role in hybrid decision systems where other machine learning techniques are used to address the reasoning, bridging the gap between data and application decisions. We expect deep learning to be applied to increasingly multi-modal problems with more structure in the data, opening application domains in robotics and data mining.
This special issue in the high-impact IEEE Signal Processing Magazine seeks to provide a venue accessible to a wide and diverse audience to survey the recent research R&D advances in learning, including deep learning and beyond. Interested authors are asked
to prepare a white paper first based on the instruction and schedule outlined below.
Topics of Interest include (but are not limited to):
- Advanced deep learning techniques for supervised learning
- Deep learning for unsupervised & semi-supervised learning
- Online, reinforcement, incremental learning by deep models
- Domain adaptation and transfer learning with deep networks
- Deep learning for spatiotemporal data and dynamic systems
- Visualization of deep features
- New zero- and one-shot learning techniques
- Advanced hashing and retrieval methods
- Software and specialized hardware for deep learning
- Novel applications and experimental activities
White papers are required, and full articles are invited based on the review of white papers. The white paper format is up to 4 pages in length, including proposed article title, motivation and significance of the topic, an outline of the proposed paper, and
representative references; an author list, contact information and short bios should also be included. Articles submitted to this issue must be of tutorial and overview/survey nature and in an accessible style to a broad audience, and have a significant relevance to
the scope of the special issue. Submissions should not have been published or under review elsewhere, and should be made online at http://mc.manuscriptcentral.com/sps-ieee. For submission guidelines, visit http://signalprocessingsociety.org/publicationsresources/
- Prof. Fatih Porikli, Australian National University, firstname.lastname@example.org
- Dr. Shiguang Shan, Chinese Academy of Sciences, email@example.com
- Prof. Cees Snoek, University of Amsterdam, firstname.lastname@example.org
- Dr. Rahul Sukthankar, Google, email@example.com
- Prof. Xiaogang Wang, Chinese University of Hong Kong, firstname.lastname@example.org
The ECCV 2016 paper Spot On: Action Localization from Pointly-Supervised Proposals by Pascal Mettes, Jan van Gemert and Cees Snoek is now available. We strive for spatio-temporal localization of actions in videos. The state-of-the-art relies on action proposals at test time and selects the best one with a classifier demanding carefully annotated box annotations at train time. Annotating action boxes in video is cumbersome, tedious, and error prone. Rather than annotating boxes, we propose to annotate actions in video with points on a sparse subset of frames only. We introduce an overlap measure between action proposals and points and incorporate them all into the objective of a non-convex Multiple Instance Learning optimization. Experimental evaluation on the UCF Sports and UCF 101 datasets shows that (i) spatio-temporal proposals can be used to train classifiers while retaining the localization performance, (ii) point annotations yield results comparable to box annotations while being significantly faster to annotate, (iii) with a minimum amount of supervision our approach is competitive to the state-of-the-art. Finally, we introduce spatio-temporal action annotations on the train and test videos of Hollywood2, resulting in Hollywood2Tubes, available at tinyurl.com/hollywood2tubes.
The ECCV 2016 paper Online Action Detection by Roeland De Geest, Efstratios Gavves, Amir Ghodrati, Zhenyang Li, Cees G. M. Snoek and Tinne Tuytelaars is now available. In online action detection, the goal is to detect the start of an action in a video stream as soon as it happens. For instance, if a child is chasing a ball, an autonomous car should recognize what is going on and respond immediately. This is a very challenging problem for four reasons. First, only partial actions are observed. Second, there is a large variability in negative data. Third, the start of the action is unknown, so it is unclear over what time window the information should be integrated. Finally, in real world data, large within-class variability exists. This problem has been addressed before, but only to some extent. Our contributions to online action detection are threefold. First, we introduce a realistic dataset composed of 27 episodes from 6 popular TV series. The dataset spans over 16 hours of footage annotated with 30 action classes, totaling 6,231 action instances. Second, we analyze and compare various baseline methods, showing this is a challenging problem for which none of the methods provides a good solution. Third, we analyze the change in performance when there is a variation in viewpoint, occlusion, truncation, etc. We introduce an evaluation protocol for fair comparison. The dataset, the baselines and the models will all be made publicly available to encourage (much needed) further research on online action detection on realistic data.
The BMVC2016 paper Video Stream Retrieval of Unseen Queries using Semantic Memory by Spencer Cappallo, Thomas Mensink and Cees Snoek is now available. Retrieval of live, user-broadcast video streams is an under-addressed and increasingly relevant challenge. The on-line nature of the problem requires temporal evaluation and the unforeseeable scope of potential queries motivates an approach which can accommodate arbitrary search queries. To account for the breadth of possible queries, we adopt a no-example approach to query retrieval, which uses a query’s semantic relatedness to pre-trained concept classifiers. To adapt to shifting video content, we propose memory pooling and memory welling methods that favor recent information over long past content. We identify two stream retrieval tasks, instantaneous retrieval at any particular time and continuous retrieval over a prolonged duration, and propose means for evaluating them. Three large scale video datasets are adapted to the challenge of stream retrieval. We report results for our search methods on the new stream retrieval tasks, as well as demonstrate their efficacy in a traditional, non-streaming video task.
The best paper of ICMR2016 entitled “Pooling Objects for Recognizing Scenes without Examples” by Svetlana Kordumova, Thomas Mensink and Cees Snoek is now available. In this paper we aim to recognize scenes in images without using any scene images as training data. Different from attribute based approaches, we do not carefully select the training classes to match the unseen scene classes. Instead, we propose a pooling over ten thousand of off-the-shelf object classifiers. To steer the knowledge transfer between objects and scenes we learn a semantic embedding with the aid of a large social multimedia corpus. Our key contributions are: we are the first to investigate pooling over ten thousand object classifiers to recognize scenes without examples; we explore the ontological hierarchy of objects and analyze the influence of object classifiers from different hierarchy levels; we exploit object positions in scene images and we demonstrate a new scene retrieval scenario with complex queries. Finally, we outperform attribute representations on two challenging scene datasets, SUNAttributes and Places2.
The ICMR2016 paper “The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection” by Pascal Mettes, Dennis Koelma and Cees Snoek is now available. This paper strives for video event detection using a representation learned from deep convolutional neural networks. Different from the leading approaches, who all learn from the 1,000 classes defined in the ImageNet Large Scale Visual Recognition Challenge, we investigate how to leverage the complete ImageNet hierarchy for pre-training deep networks. To deal with the problems of over-specific classes and classes with few images, we introduce a bottom-up and top-down approach for reorganization of the ImageNet hierarchy based on all its 21,814 classes and more than 14 million images. Experiments on the TRECVID Multimedia Event Detection 2013 and 2015 datasets show that video representations derived from the layers of a deep neural network pre-trained with our reorganized hierarchy i) improves over standard pre-training, ii) is complementary among different reorganizations, iii) maintains the benefits of fusion with other modalities, and iv) leads to state-of-the-art event detection results. The reorganized hierarchies and their derived Caffe models are publicly available at http://tinyurl.com/imagenetshuffle.