New Method for Detecting Objects in Images

Wednesday, January 9, 2013

Using techniques from the field of robotics, Vision Group researchers and their colleagues have developed a method for detecting objects in images that intelligently selects which object detectors to use and which to ignore in order to complete a task within given time constraints. The paper was presented in December at the Neural Information Processing Systems Conference. It’s by Vision Group researcher Sergey Karayev and group leader Trevor Darrell as well as Tobias Baumgartner of RWTH Aachen University and Mario Fritz of MPI for Informatics, who has worked at ICSI as a postdoc.

object recognition
A sample trace of the method.1

The task of visual object recognition is to correctly localize and identify all "objects" in a photograph. In a commonly used computer vision dataset, the PASCAL Visual Object Challenge, objects are labeled by their general category, such as "aeroplane," "car," or "person." For an advertising company on the Internet, the task may be similar – to identify a specific model of car in all images uploaded to a Web site like Flickr or Instagram.

In this and other datasets, certain object classes tend to occur together: for example, buses with cars, or people with bicycles and horses. Current state-of-the-art approaches to detecting all objects of a specific type in an image take about a second to run, per object class. Since many types of objects need to be detected, processing a single image (by running all the object detectors on it) may take well over a minute.

The researchers looked at the case where there isn't enough time to run all the detectors: for example, an advertising company with a long queue of images may only have ten seconds to process an image. In this situation, a subset of object classes to detect needs to be selected so as to maximize the chance of finding the most valuable classes.

This selection of detectors is  treated as a sequential decision process. Each detector gives some (imperfect) information about the presence of the corresponding object class in the image, and the method makes use of this information when selecting the next detector.

The decision process is trained using reinforcement (reward-based) earning, a robotics technique that is not often used in computer vision. The reward obtained after selecting a detector is defined as the area under the detection performance versus time curve that the detector contributes to the overall sequence. This results in a system for dynamic selection of detectors whose performance significantly surpasses a static selection baseline.

The bottom line is this: if there are existing object detectors, but not enough time to run all of them in an image, the method can intelligently run the ones that fit within a time budget to maximize the overall multi-class detection performance.

Related Paper:

Timely Object Recognition.” Sergey Karayev, Tobias Baumgartner, Mario Fritz, and Trevor Darrell. Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS 2012), Lake Tahoe, Nevada, December 2012.


1. At each time step beginning at t=0, potential actions are considered according to their predicted value, and the maximizing action is picked. The selected action is performed and returns observations. Different actions return different observations: a detector returns a list of detections, while a scene context action simply returns its computed feature. The belief model of the system is updated with the observations, which influences the selection of the next action. The final evaluation of a detection episode is the area of the Performance vs. Time curve between given start and end times. The value of an action is the expected result of final evaluation if the action is taken and the policy continues to be followed, which allows actions without an immediate benefit to be scheduled.