Featured Research: Multimedia

multimedia pullquote


A team of researchers spanning the Speech, Vision, and Networking groups is collaborating on ways to extract meaning from the vast amounts of multimedia data freely available online — a dataset that includes hours of new videos and thousands of new photos uploaded to the Internet every minute. This unconstrained data — videos and images unregulated for quality, size, or content — presents challenges for techniques known to recognize sounds and images successfully in laboratory conditions. But by analyzing multiple modalities — by combining, for example, audio processing techniques with image analysis — the researchers hope to design methods for creating meaning out of the enormous amounts of data available.

"The only way to solve these problems is to be as open as the dataset," said Gerald Friedland, a senior researcher in the Speech Group who leads ICSI's multimedia efforts. "It's about taking everything you can into account."

The team works closely with UC Berkeley's Parallel Computing Laboratory, which provides new algorithmic ideas that help the researchers deal with the huge quantity of data. ParLab is funded in part by Microsoft and Intel.

ICSI researchers have been working on acoustic recognition since 1988, and on computer vision since 2008. But recent efforts are the first that aim to combine acoustic and visual recognition in order to extract meaning from data.


Using both modalities, the team is designing ways to identify automatically where consumer-produced videos were taken. The team has applied ICSI's speaker diarization system to audio tracks taken from a dataset of unconstrained videos, building audio profiles of the 18 cities where the videos were taken. The speaker diarization system is traditionally used on audio tracks containing human speech; it identifies who spoke when. But the team found that the system can identify where videos were shot by detecting subtle cues even in audio tracks that do not contain human speech.

"It's hard for humans to do this task," said Howard Lei of the Speech Group, "but our algorithms were able to pick things up" that a human might not. Future work, to be supported by the National Science Foundation, will look at how accurately humans are able to place the videos.

When the researchers discounted videos submitted by the same user, the system correctly identified almost 68 percent of the videos. When videos from the same users are included, the system identified almost 80 percent of the videos correctly.

The current work is supported by the National Geospatial-Intelligence Agency University Research Initiatives Program.

It's unclear exactly what sounds the system uses to identify cities. While the researchers noticed some audio trends specific to certain cities - for example, the sound of birds is prominent in videos from San Francisco, while the sound of trains is prominent in videos from Tokyo — they "still don't know which features of the audio this system picks up on," said Jaeyoung Choi of the Speech Group.

Choi has begun to add visual recognition techniques to the system to improve accuracy. "Visual [analysis] is more straightforward" than audio analysis, he said. The researchers will analyze visual elements such as which way lines are oriented, which might suggest sea lines or tall buildings. They will also analyze what textures appear in videos, which helps to identify pavement, grass, and other features, and how color is distributed.

Choi will use millions of photos pulled from the Internet in order to develop visual profiles for different cities around the world. The researchers will compare a frame from the video that they are trying to place against the cities' profiles to find which profile matches the frame most closely.

The researchers decided to pursue this technique — called nearest-neighbor matching — rather than trying to detect landmarks unique to particular cities. "In terms of research value, landmark recognition has already been done," said Choi, while identifying scenes based on visual elements "is more of an emerging field."

multimedia graphic


Researchers are also working toward a system that can detect concepts in videos — that can, for example, search large collections of videos for those that match statements like "feeding an animal." In the ALADDIN program, funded by IARPA, teams from institutions around the world are building a concept detection system. IARPA has provided the teams with tens of thousands of consumer-produced videos, some of which are labeled as belonging to one of 15 categories. Given the labeled examples, the challenge is to find videos that belong in any of the 15 categories from a set of about 50,000 unlabelled videos. ICSI researchers, working closely with SRI, Carnegie Mellon University, and other research institutions, are using acoustic analysis in two main efforts toward the goal of detecting concepts in these videos.

The first system involves what Robert Mertens calls "holistic analysis." Mertens is visiting ICSI from Germany to work full time on the project, which is led by Friedland. The researchers built a model for each category based on all acoustic features from all videos labeled as belonging to that category. "We try to find and learn configurations of audio features that indicate that a video belongs to a category," Mertens said. The team also tracked those features that occurred frequently in all videos, which aren't helpful in determining which category a video belongs to. While the process works, it provides little insight into how it works. Mertens said the method is like "looking into a black box": researchers aren't able to say what aspects of a video their system identifies as belonging to a certain category.

Since one goal of the project is to be able to explain how a system gets its results, Mertens and the team also used ICSI's speaker diarization system to group the videos' audio tracks into segments of similar sounds, developing a profile for each category based on the sounds that frequently occur in the videos belonging to that category. The speaker diarization system is traditionally used to identify who spoke when in audio tracks that contain speech. The system usually only takes into account segments of sound that last for two seconds or longer, and segments of sound that contain speech. The team modified the system so that not only did it analyze sounds that are not speech, but it also analyzed much shorter segments of sound, such as drum beats. The system then analyzed the videos labeled as belonging to a category and identified the sounds that best represented the category based on how frequently the sounds occurred in each video. The videos without labels are then searched to see which match the profiles. The sounds are essentially treated as words: if, for example, the word "thoracic" appears frequently in a book, it's plausible the book is medical in nature. Similarly, if a particular sound occurs frequently in a video's audio track, the video may belong to a category that contains other videos with the same sound. The team found 300 sounds that could be used to identify which category a video belongs to.

Eventually, says Mertens, the method might allow researchers to ignore certain sounds that only incidentally occur in a category, or to ignore certain combinations of sounds. For example, if a user is searching for a video depicting a wedding, it might be helpful to include sounds of applause, but not of guitar music typically associated with hard rock concerts. Mertens's system may allow users to exclude these combinations. The system is also "more explicable," said Mertens, "because you can tell the user the system's decision is based on the frequency of occurrence of these sounds."


The multimedia work has serious implications for people who upload videos and photos to Web sites like YouTube and Flickr: if it is possible to identify where videos were taken and what they depict using visual and acoustic recognition methods, it might also be possible to find out a great deal of information about the people who have created them.

Last year, Friedland and Robin Sommer, a senior researcher in the Networking Group, found that they could identify, for example, the home addresses of people who were on vacation by extracting the longitude and latitude embedded in photos posted to Web sites like Craigslist and Flickr. Many high-end smart phones and digital cameras embed geo-tags, precise coordinates showing where a photo or video was taken. By combining geo-tags from different videos — say, one labeled "home" and one labeled "vacation" — the researchers were able to find the home addresses of people currently on extended trips.

Researchers are now turning their attention to what they can learn from analysis of video and audio tracks. In work presented at ICASSP this year and supported by the National Science Foundation, researchers used speaker recognition methods on the audio tracks of Flickr videos to determine whether they were uploaded by the same user. Lei of the Speech Group said the technique "allows us to tie together the identifications of different profiles. We're trying to raise this concern."

Lei, Choi, Adam Janin, and Friedland, all of the Speech Group, trained ICSI's speaker recognition machine on videos posted to Flickr. In laboratory conditions, with high-quality audio tracks of controlled lengths and content, the machine is good at identifying who is speaking. Consumer-created videos like those uploaded to Flickr, on the other hand, are limited to 90 seconds; of the videos used by ICSI researchers in the ICASSP work, one-third were 20 seconds or shorter. The videos may also include any sounds — passing cars or a neighbor's music; until recently, research in the field of acoustic processing has not focused on such sounds. ICSI's multimodal researchers, however, "wanted to see how established approaches deal with the random data we're getting," said Lei, and this required the analysis of audio content that lies outside the realms of speech and music, on which much research has been done.

Despite the widely varying quality and short length of the videos, however, researchers were able to identify about 66 percent of users.

The work showed how simple it was to link different online profiles, using just the audio track. "There are so many cues that [Web sites] can't really protect privacy," Friedland said. For example, while some Web sites claim to protect users' anonymity — dating services, for example — videos posted there may be matched to videos posted to more public profiles, such as those on YouTube.

And with billions of videos publicly accessible on the Web, millions of users could potentially be identified across various online profiles.

In other work, Choi and Friedland modified the earlier geo-tagging study, which relied on geo-coordinates extracted from YouTube videos to find home addresses of people on vacation who are potential victims of burglary while out of town. They found they could achieve similar results even without geo-tags. They extracted text from the tags added to videos (such as "San Francisco") and ran it through several filters to account for ambiguities. These filters, all derived from programs freely available on the Internet, helped prevent inaccuracies arising from tags that could refer to multiple places (for example, Paris, France and Paris, Texas), that contained misspellings or incorrect spacing (sanfrancisco for San Francisco), and that included words that could refer either to the name of a location or to something else entirely (the word "video" in a tag, for example, is probably not referring to Video, Brazil). While the accuracy of this method is low, it does not depend on geo-tags, which are embedded in only a fraction of the videos uploaded to the Internet. This means the method can be applied to a far larger set of videos. The researchers were thus able to positively identify the same number of potential victims as the earlier work did.


The multimodal work with unconstrained data has applications beyond analysis of consumer-created content found online. Researchers are also working with a UC Berkeley team to construct a robot that can both see and hear in order to improve its capacity to deal with real-world situations. Friedland is also contributing to the Robust Automatic Transcription of Speech project, funded by DARPA, which seeks to improve the accuracy of speech processing tasks, such as speaker identification, for sources of poor quality. Friedland is developing a system that can identify the parts of an audio track that contain speech and the parts that do not. He is currently working with consumer-produced videos to train the system.