Fine-grained Recognition

Principal Investigator(s): 
Trevor Darrell

Recognizing objects in fine-grained domains can be extremely challenging due to the subtle differences between subcategories. Discriminative markings are not only subtle but often highly localized, with which traditional object recognition approaches struggle when dealing with the large pose variation often present in these domains. The ability to normalize pose based on super-category landmarks can significantly improve models of individual categories when training data is limited. Previous methods have considered the use of volumetric or morphable models for faces and for certain classes of articulated objects.

Vision researchers at ICSI developed representations for poselet-based pose normalization using both explicit warping and implicit pooling as mechanisms. Their method defines a pose-normalized similarity or kernel function that is suitable for nearest-neighbor or kernel-based learning methods. This work has been presented at CVPR and EECV conferences; recently, they considered how to extend this method to rely on faster, more robust pose detectors and to explicitly incorporate convolutional network models.

Pose-normalization seeks to align training exemplars, either by part or for the whole object, effectively factoring out differences in pose and in viewing angle. The researchers' effort factorizes the problem of pose-normalization into (i) localizing semantic parts and (ii) learning an optimal description. For localization, they proposed a part detector based on a strongly supervised variant of the state-of-the-art deformable part model. To describe the appearance of these parts, or semantic “pooling regions,” they utilized multiple kernel learning to select the best features for each subcategory. These methods were considered alongside previously proposed models for both the localization and the description stages and rigorously studied in a comprehensive evaluation across multiple fine-grained datasets.