Speech Projects

Towards Modeling Human Speech Confusions in Noise

Researchers are studying how background noise and speaking rate affect the ability of humans to recognize speech. In this project, they evaluate components of a model of human speech perception. Researchers look at the effect of incorporating spectro-temporal filters, which operate in the human auditory cortex and are sensitive to particular modulations in auditory frequency. The results from this project will improve our understanding of how humans perceive sound, and they could be used to improve artificial systems for speech processing, such as hearing aids.

SWORDFISH

Researchers are developing ways to find spoken phrases in audio from multiple languages. A working group, called SWORDFISH, includes scientists from ICSI, the University of Washington, Northwestern University, Ohio State University, and Columbia University. The acronym expands to a rough description of the effort: Spoken WOrdsearch  with Rapid Development and Frugal Invariant Subword Hierarchies.

Project Ouch - Outing Unfortunate Characteristics of HMMs (Used for Speech Recognition)

The central idea behind this project is that if we want to improve recognition performance through acoustic modeling, then we should first quantify how the current best model — the hidden Markov model (HMM) — fails to adequately model speech data and how these failures impact recognition accuracy. We are undertaking a diagnostic analysis that is an essential component of statistical modeling but, for various reasons, has been largely ignored in the field of speech recognition. In particular, we believe that previous attempts to improve upon the HMM have largely failed because this diagnostic information was not readily available. In our initial research, we are using simulation and a novel sampling process to generate pseudo test data that deviate from the HMM in a controlled fashion. These processes allow us to generate pseudo data that, at one extreme, agree with all of the model's assumptions, and at the another extreme, deviate from the model in exactly the way real data does. In between, we precisely control the degree of data/model mismatch. By measuring recognition performance on this pseudo test data, we are able to quantify the effect of this controlled data/model residual on recognition accuracy.

Speaker Diarization

Speaker diarization consists of segmenting and clustering a speech recording into speaker homogenous regions, so that given an audio track of a meeting the system will discriminate and label the different speakers automatically ("who spoke when?"). This entails speech/non-speech detection ("when is there speech?"), and overlap detection and resolution ("who is overlapping with whom?"). ICSI has a long history of research in this area and has contributed repeatedly to the state of the art. Current research is aiming at improving the robustness and efficiency of current approaches.

Video Deduplication (Copyright Detection)

A duplicate video is a video that has the same content as another video but the two files do not have identical binary encodings (due to editing and/or transcoding). From the social networking perspective there is growing awareness that finding others who have done mashups or have performed simple multimedia modifications on the same data could be highly useful tools for connecting individuals to-gether or identifying piracy. We therefore develop acoustic algorithms to detect video duplicates in various conditions that complement state-of-the-art visual approaches.

Multiple Stream Speech Recognition

This project has two components.

(1) Cortically-inspired speech recognition: Acoustic events such as speech exhibit distinctive spectro-temporal amplitude modulations. These types of modulations are not well captured by conventional feature extraction methods, which involve either spectral processing or temporal processing at a time.

The "Poor Quality" Meetings Corpus

Go to any meeting or lecture with the younger generation of researchers, business people, or government employees, and there is a laptop or smart phone at every seat. Each laptop and smart phone is capable not only of recording and transmitting video and audio in real time, but also of advanced analytics on the data (e.g., speech recognition, speaker identification, face detection, etc.). Yet this rich resource goes largely unexploited, mostly because there are not enough good training data for machine learning algorithms.

Speaker Recognition

This project is concerned with the discovery of highly speaker-characteristic behaviors ("speaker performances") for use in speaker recognition and related speech technologies. The intention is to move beyond the usual low-level short-term spectral features which dominate speaker recognition systems today, instead focusing on higher-level sources of speaker information, including idiosyncratic word usage and pronunciation, prosodic patterns, and vocal gestures.

Speech Processing for Meetings

We seek to develop algorithms and systems for the recognition of speech from meetings, as well as methods for information retrieval and other applications that such recognition would make possible. Funding for this research is provided by the Swiss project, IM2: Interactive Multimodal Information Management. IM2 Web site; ICSI's meeting recorder project page.

Robust Automatic Transcription of Speech

This DARPA-funded program seeks to significantly improve the accuracy of several speech processing tasks (speech activity detection, speaker identification, language identification, and keyword spotting) for degraded audio sources. As part of the SRI Speech Content Extraction from Noisy Information Channels (SCENIC) Team, we are working primarily on feature extraction (drawing on our experience with biologically motivated signal processing and machine learning) and speech activity detection (drawing on our experience with speech segmentation).