About ICSI Groups Projects Publications Events Partnerships Visitor Programs News Search
Algorithms Projects AI Projects Architecture Projects Networking Projects Speech Projects Vision Projects Projects of Other Activities
       
 

Projects

Speech

   
 

Global Autonomous Language Exploitation (GALE)

The goal of the DARPA GALE program is to develop and apply technologies to absorb, analyze and interpret huge volumes of speech and text in multiple languages. Automatic processing "engines" will convert and distill the data, delivering pertinent, consolidated information in easy-to-understand forms to military personnel and monolingual English-speaking analysts in response to direct or implicit requests.

GALE consists of three major engines: Transcription, Translation and Distillation. The output of each engine is English text. The input to the transcription engine is speech and to the translation engine, text. Engines pass along pointers to relevant source language data that will be available to humans and downstream processes. The distillation engine integrates information of interest to its user from multiple sources and documents.

ICSI is currently participating in GALE as part of the both the BBN team (contributing machine translation technology) and the IBM team (contributing novel feature extraction approaches to improve speech recognition).

Speech Processing for Meetings

ICSI researchers seek to develop algorithms and systems for the recognition of speech from meetings, as well as methods for information retrieval and other applications that such recognition would make possible. Funding for this research is provided by the Swiss project, IM2: Interactive Multimodal Information Management. IM2 Website ICSI's meeting recorder project page.

Speaker Recognition

This project is concerned with the discovery of highly speaker-characteristic behaviors ("speaker performaces") for use in speaker recognition and related speech technologies. The intention is to move beyond the usual low-level short-term spectral features which dominate speaker recognition systems today, instead focusing on higher-level sources of speaker information, including idiosyncratic word usage and pronunciation, prosodic patterns, and vocal gestures.
The project goal is two-fold: to conduct fundamental research to discover new speaker-distinctive features and encode them into richer, more informative speaker models; and to evaluate the utility of these feature sets and models for speaker recognition and other speech technology applications. The feature discovery efforts are necessarily exploratory, pursuing both a "knowledge-based" track, building on existing linguistic constructs and guided by insights from psycholinguistics and human performance studies, and a more speculative "data-driven" approach, seeking idiosyncratic "vocal performances" --- spectr-temporal patterns with high speaker-characterizing power, independent of linguistic constraints. Speaker Recognition Project Page

My Speech-to-Text (MySTT)

The MySTT ("My Speech-To-Text") project is a development effort to create a free speech recognition engine aimed at the automatic transcription of natural, large-vocabulary, human-to-human communication. It is implemented based on GStreamer, a popular multimedia streaming framework, and an extension of it called Appscio MPF, which extends GStreamer for multimedia analytics. The goal of MySTT is to be easily extendable and interfaceable with other products and research projects in the multimedia realm. All components, including the models, are under open source licensing free to use for both research as well as commercial purposes.

Speech Technology for Developing Countries

ICSI researchers are developing speech recognition technologies for "emerging regions". As part of this effort, they have developed simple recognizers for Tamil, a language spoken by over 50 million people in Southease India, where illiteracy rates hover around 50% for men and between 60% to 80% for women. Speech recognition, especially in combination with speech synthesis and compelling visual user interfaces, may be key in increasing access to technology to primarily oral communities. They have designed and field tested prototypes for speech recognition applications, collectively called Open Sesame, which includes a multi-modal system that accepts both voice and touch input to provide farmers and other rural community members with information on agricultural innovations and crop varieties, as recommended by local experts in Tamil Nadu. The system is one example of ICSI's capability to rapidly design and deploy low-cost speech prototypes using openly available technology.

Mutaphrase

Many natural language processing (NLP) applications implicitly or explicitly depend on content being expressed in a particular way. Thus, a process which is programmed or trained for the sequence "You weren't smart to eat fugu" will not necessarily handle the semantically equivalent paraphrase "Eating blowfish was dumb of you". The mutaphraser automatically generates variants of an input sentence using the semantics and syntax encoded in FrameNet and the lexical semantic information in WordNet. The utility of mutaphrasing is tested on various NLP applications including speech recognition, machine translation training, and machine translation evaluation.

Multiple Stream Speech Recognition

This project has three components.

(1) Cortically-inspired speech recognition: Acoustic events such as speech exhibit distinctive spectro-temporal amplitude modulations. These types of modulations are not well-captured by conventional feature extraction methods, which involve either spectral processing or temporal processing at a time.

Recent findings from mammalian-auditory-cortical receptive field measurements suggest that biological systems are highly-tuned to spectro-temporal modulations. The spectro-temporal receptive fields (STRFs) of cortical cells are found to resemble 2-D spectro-temporal Gabor filters. In prior work, researchers have used 2-D Gabor filters to extract spectro-temporal features for speech recognition and speech discrimination. However, these studies have involved only single streams of task-optimized features to very large multi-dimensional representations of spectro-temporal responses. Therefore, there is a need to explore the use of multiple streams of spectro-temporal features, which may preserve the organizational map of STRFs and alleviate cumbersome computation of sizable data, in speech recognition.

This research aims to develop, evaluate, and incorporate multi-stream spectro-temporal features for robust speech recognition.

(2) Parallel processing for speech recognition: In noisy or reverberant environments, more processing will be needed for speech recognition. If a mobile device is used then the device will often be elsewhere than right up near the user's mouth, which will hurt ASR. For instance, in the most recent NIST evaluations, the best word error rate for multi-microphone speech recognition in a conference room was about 40%. This used beamforming, but as yet does not have the techniques we propose below, which have the potential of significantly reducing this error rate, at the expense of using much more computational power.

A parallel processing approach that could help further is the multi-stream methodology, in which multiple signal representations are used to generate posterior probabilities of speech sound classes, and then are combined and further transformed (Gaussianized and orthogonalized) to generate input features for a statistical speech recognition engine. Multi-layer perceptrons generate the individual posterior probabilities. These methods have been successfully used for 2-15 streams, but we would ultimately like to work with much larger ensembles of feature generators. We will start our work using the Quicknet libraries that were developed at ICSI, parallelizing it for the target approaches discussed in this proposal. We will then develop code that incorporates these libraries in a system that permits experimentation and ultimately exhibits much greater robustness for speech recognition in moderate noise and reverberation with microphones that are not head-mounted. This work is closely connected with the Berkeley ParLab, which is described here.

 

More about the Speech Research Group >>

top

   
Copyright © 2007 International Computer Science Institute. All Rights Reserved.