Audio and Multimedia Projects

Multi Modal Video Summarization

ICSI researchers have been working with DAC to identify and acquire datasets that are sufficient for training Automated Speech Recognition (ASR) models. They are researching and developing ASR models that are robust to noise, music, babble and reverberation. This may include, but is not limited to, the research and implementation of signal processing algorithms that remove segments of an audio stream that do not include speech.

Multimodal Feature Learning for Understanding Consumer Produced Multimedia Data

ICSI is working with LLNL on ongoing work on feature extraction and analytic techniques that map raw data from multiple input modalities (e.g., video, images, text) into a joint semantic space. This requires cutting edge research in multiple modalities, as well as in the mathematical methods to learn the semantic mappings.

Teaching Security

The Teaching Security project is providing classroom-ready materials to support high-school teachers in teaching about important cybersecurity principles, helping students understand the major vulnerabilities, why they occur, and what defensive strategies can be used. The materials focus on inquiry-based activities and hands-on interactive apps and demos that allow students to explore for themselves how cybersecurity works.

Shining Light on Non-Public Data Flows

This project looks into the usage and collection of data by programs that operate behind the scenes. The collected data and its use by a network of sellers, brokers, and marketers represents a direct privacy threat as it can be used for marketing, profiling, crime, or government surveillance, and yet consumers have little knowledge about it and no legal means to access the data. ICSI researchers are conducting surveys and experiments to determine the current status of this data and observe its effects.

Previous Work: Teaching Resources for Online Privacy Education (TROPE)

Researchers are developing classroom-ready teaching modules to educate young people about why and how to protect their privacy online, as well as a Teachers' Guide with background information, suggested lesson plans, and guidance on how to employ the modules in the classroom.

Knowledge-Aided Interface for Big Data Streams

In this collaborative project with Mod9 Technologies, researchers from ICSI's Audio and Multimedia group and ICSI's FrameNet project seek to demonstrate real­time monitoring of broadcast news streams to support a tactical operations center (TOC). A primary focus of this effort is to exploit multimedia – audio­visual data containing speech, images, and metadata such as geo-location and personal identification – and integrate it into an intuitive and informative visualization for a TOC’s use.

SMASH - Scalable Multimedia content AnalysiS in a High-level language

This big data project develops tools to support researchers and developers in the task of prototyping multimedia content analysis algorithms on a large scale. Typically, scientists and engineers prefer to use high-level programming languages such as Python or MATLAB to conduct experiments, as they allow for a quick implementation of a novel idea.

Previous Work: Privacy Literacy with San Jose Public Library

ICSI researchers are collaborating with the San Jose Public Library and San Jose State University's Game Development club to develop an online tool which will help individuals understand privacy in the digital age and make informed decisions about their online activity. Beyond the standard educational aid, this tool will be non-biased, acknowledging that people have many different definitions of privacy and may have different needs based on what kind of online persona they have created.

Multimodal Location Estimation

Location estimation is the task of estimating the geo-coordinates of the content recorded in digital media The Berkeley Multimodal Location Estimation project aims to leverage the GPS-tagged media available on the web as training set for an automatic location estimator. The idea is that visual and acoustic cues can narrow down the possible recording location for a given image, video, or audio track. We also investigate the human baseline of location estimation, i.e. how well does a human do in comparison to a computer?


Researchers are exposing the ways in which it is possible to aggregate public and seemingly innocuous information from different media and Web sites to attack the privacy of users. The project seeks to help users, particularly younger ones, understand the privacy implications of the information they share publicly on the Internet and to help them understand what control they can exercise over it.

Video Concept Detection

Massive numbers of video clips are generated daily on many types of consumer electronics and uploaded to the Internet. In contrast to videos that are produced for broadcast or from planned surveillance, the "unconstrained" video clips produced by anyone who has a digital camera present a significant challenge for manual as well as automated analysis. Such clips can include any possible scene and events, and generally have limited quality control.