Speech Projects

Automatic Recognition of Camera Speech (ARCS)

In this ICSI project, researchers are working to improve speech recognition from noisy, often distorted audio taken from the body cameras of working police officers during traffic stops. This is part of a larger project at Stanford to extract information from these data. The Stanford project is focused on the analysis of the interactions between the officers and the communities they serve, in the hope that they could help to transform the relationship between the police and communities, produce solid data on officer-community interaction, and inform officer training programs.

Word Bug

ICSI Speech researchers are working with Versame to develop methods for the analysis of speech being directed at infants and toddlers, in order to provide better measures of the lexical stimulation they are getting. The initial project is focused on the counting of speech units from unrestricted audio, where the likely speech units are syllables or words.

Deep and Wide Learning for Automatic Speech Recognition

In this project, speech researchers are looking at trade-offs between two approaches to automatic speech recognition (ASR): signal processing of multiple acoustic features vs. using simpler features and relying on machine learning algorithms to replace feature engineering. The goal is not only to improve accuracy for difficult examples, but also to learn about the computational consequences for high performance computing.

How Does Deep Learning Improve Speech Recognition Accuracy?

The short-term goal of this project is to understand in a deep, quantitative way why methodology used in nearly all speech recognizers is so brittle. The long-term goal is to leverage this understanding by developing less brittle methodology that will enable more accurate speech recognition with a wider scope of applicability.

COrtical Separation Models for Overlapping Speech (COSMOS)

In this collaborative project among ICSI, UCSF, and Columbia, researchers are measuring brain activity to understand in detail how human listeners are able to separate and understand individual speakers when more than one person is talking at the same time. This information can then be used to design automatic systems capable of the same feat.

Towards Modeling Human Speech Confusions in Noise

Researchers are studying how background noise and speaking rate affect the ability of humans to recognize speech. In this project, they evaluate components of a model of human speech perception. Researchers look at the effect of incorporating spectro-temporal filters, which operate in the human auditory cortex and are sensitive to particular modulations in auditory frequency. The results from this project will improve our understanding of how humans perceive sound, and they could be used to improve artificial systems for speech processing, such as hearing aids.


Researchers are developing ways to find spoken phrases in audio from multiple languages. A working group, called SWORDFISH, includes scientists from ICSI, the University of Washington, Northwestern University, Ohio State University, and Columbia University. The acronym expands to a rough description of the effort: Spoken WOrdsearch  with Rapid Development and Frugal Invariant Subword Hierarchies.

Project Ouch - Outing Unfortunate Characteristics of HMMs (Used for Speech Recognition)

Project OUCH has been completed, and the final report is available here.

The central idea behind this project is that if we want to improve recognition performance through acoustic modeling, then we should first quantify how the current best model — the hidden Markov model (HMM) — fails to adequately model speech data and how these failures impact recognition accuracy. We are undertaking a diagnostic analysis that is an essential component of statistical modeling but, for various reasons, has been largely ignored in the field of speech recognition. In particular, we believe that previous attempts to improve upon the HMM have largely failed because this diagnostic information was not readily available. In our initial research, we are using simulation and a novel sampling process to generate pseudo test data that deviate from the HMM in a controlled fashion. These processes allow us to generate pseudo data that, at one extreme, agree with all of the model's assumptions, and at the another extreme, deviate from the model in exactly the way real data does. In between, we precisely control the degree of data/model mismatch. By measuring recognition performance on this pseudo test data, we are able to quantify the effect of this controlled data/model residual on recognition accuracy.

Speaker Recognition

This project is concerned with the discovery of highly speaker-characteristic behaviors ("speaker performances") for use in speaker recognition and related speech technologies. The intention is to move beyond the usual low-level short-term spectral features which dominate speaker recognition systems today, instead focusing on higher-level sources of speaker information, including idiosyncratic word usage and pronunciation, prosodic patterns, and vocal gestures.

Robust Automatic Transcription of Speech

This DARPA-funded program seeks to significantly improve the accuracy of several speech processing tasks (speech activity detection, speaker identification, language identification, and keyword spotting) for degraded audio sources. As part of the SRI Speech Content Extraction from Noisy Information Channels (SCENIC) Team, we are working primarily on feature extraction (drawing on our experience with biologically motivated signal processing and machine learning) and speech activity detection (drawing on our experience with speech segmentation).

Funding provided by DARPA.