Vision Projects

Non-Parametric Video Segmentation

Sudderth and Jordan has successfully proposed a non-parametric image segmentation engine by imposing a hierarchical pitman-yor prior on the data and training the probabilistic model through variation learning. We propose to extend the framework to the video domain by incorporating temporal and optical flow information. The project tackles the problems of scarcity of segmented, annotated video dataset, as well as computational issues via GPU computation and distributed computing (e.g., EC2, Hadoop clusters).

Sufficient Dimensionality Reduction for Sequence Classification

The Sufficient Dimensionality Reduction (SDR) framework seeks to find a latent subspace that captures as much information of the covariates with respect to the output labels via conditional independence. As a particular instance of SDR, the kernel dimensionality reduction (KDR) algorithm achieves the conditional independence through minimizing the trace of the cross covariance operator. We propose to extend the existing framework for static data to include sequential information through kernel design and dynamic time warping.

Probabilistic Models for Multi-View Learning and Distributed Feature Selection

Many problems in machine learning contain datasets that are comprised of multiple independent feature sets or views, e.g., audio and video, text and images, and multi-sensor data. In this setting, each view provides a potentially redundant sample of the class or event of interest. Techniques in multi-view learning exploit this property to learn under weak supervision by maximizing the agreement of a set of classifiers defined in each view over the training data.

Visual Sense Disambiguation Using Multiple Modalities

Traditionally, object recognition requires manually labeled images of objects for training. However, there often exist additional sources of information that can be used as weak labels, reducing the need for human supervision. In this project we use different modalities and information sources to help learn visual models of object categories. The first type of information we use is the speech uttered by a user referring to an object. Such spoken utterances can occur in interaction with an assistant robot, voice-tagging a photo, etc.

Transparent Object Recognition

Despite the omni-presence of transparent objects in our daily environment, little research has been conducted on how to recognize and detect such objects. The difficulties of this task lie in the complex interactions between scene geometry and illuminants that lead to changing refractory patterns. Realizing that a complete physical model of these phenomena is out of reach at the moment, we seek different machine learning solutions to approach this problem.

Nonrigid Object Recognition and Tracking

In our everyday life, we manipulate many nonrigid objects, such as clothes. In the context of personal robotics, it would therefore be important to correctly recognize and track these objects for a robot to interact with them. While tracking and recognition of rigid objects has received a lot of attention in the Computer Vision community, similar tasks for deformable ones remain mainly unstudied. The main challenges that need to be addressed arise from the much larger appearance variability of such objects.

Hashing Algorithms for Scalable Image Search

A common problem in large-scale data is that of quickly extracting nearest neighbors to a query from a large database. In computer vision, for example, this problem arises in content-based image retrieval, 3-D image reconstructions, human body pose estimation, object recognition problems, and other problems. This project focuses on developing algorithms for quickly and accurately performing large-scale image searches using hashing techniques.

Grounded Semantics

This project explores how to define the meaning of prepositions using visual data. One potential application is to be able to command a robot to arrange objects in a room. For example, in order for a robot to be able to follow the command "Put the cup there, on the front of the table," the robot must identify the target location of the cup. The robot can only identify this location if it understands the meanings of each of the components.

Facial Image Indexing Interfaces

During a disaster a large number of children may become separated from their families. Many of these children, especially the younger ones, may be unable or unwilling to identify themselves, making the task of reuniting them with their families especially difficult. Without a system in place for hospitals to document their unidentified children and to help parents search, families could be separated for months. After Hurricane Katrina it was six months until the last child was reunited with her family.

Bayesian Localized Multiple Kernel Learning

Multiple kernel learning approaches form a set of techniques for performing classification that can easily combine information from multiple data sources, e.g., by adding or multiplying kernels. Most methods, however, are limited by their assumption of a per-view kernel weighting. For many problems, the set of features important for discriminating between examples can vary locally.

Active Learning with Multiple Views

For many real world object recognition tasks a common difficulty is the high cost of generating labels for a large pool of unlabeled images. In order to learn a concept with the help of a human expert, we aim at picking only a small subset of examples that is most "helpful" for the classifier. The concept of active learning tackles this setting by enabling a classifier to pose specific queries that are chosen from an unlabeled dataset.