24 Years of Speech Recognition Work at ICSI

Speech recognition has recently become a popular topic, with Apple’s Siri and other voice assistants making frequent appearances in the news. But speech technology’s recent surge in popularity isn’t related to any major breakthroughs in speech technology. Rather, advances in speech recognition have been incremental, and, according to Speech Group researchers, there’s still plenty of work to do before technology can understand human speech as well as humans.

“Machine intelligence fails at all sorts of things that humans don’t,” said Nelson Morgan, who has led the group since its formation in 1988.

ICSI has had its share of success in the 24 years that researchers here have been working on speech recognition. What is now the Speech Group began as the Realization Group, reflecting its focus on building, or realizing, machines powerful enough to process the algorithms used in research. But even in the early years, researchers were interested in problems related to speech processing.

In a speech recognition system, audio is segmented into sequential chunks of speech, and features are extracted from each chunk. Ideally, these features are chosen to be good for distinguishing between different speech sounds (for instance, between “ba” and “ga”). On the other hand, they should be insensitive to noise and other non-speech effects, though this can be difficult.

An acoustic model, which in most systems is a Hidden Markov Model (HMM), then determines the probability that each audio chunk is a particular sound based on its associated features. Independently, a language model determines the probability of a sequence of words without considering the acoustics. For example, the phrase “I want ice cream” occurs more commonly than “I won Ty’s ream”; a language model would favor the first phrase, even if the acoustic model favored the second.

Morgan says speech research of the 1980s and 1990s may have suffered from too great an emphasis on refining these probabilistic models. “How you use features and what features you use are critically important to all sorts of tasks,” he said.

In 1990, ICSI hosted a workshop dedicated to front-end speech processing. At the workshop, Jordan Cohen (later to become a frequent ICSI collaborator), presented the “Problem of the Inverse E”: if you build a system to filter out the spectrum of the sound “E” from a speech data set, a human listener can still hear the “E’s”.

“We realized that the perception of each speech sound category from continuous speech couldn’t possibly be due to the gross spectrum of the sound in the small chunks that are used in speech recognition systems,” Morgan said. “This was a heretical view at the time.”

Morgan worked with Hynek Hermansky, then a researcher at US West, to develop a new way of analyzing audio, called Relative Spectral Processing or RASTA. As opposed to standard approaches, this method was more sensitive to spectral changes over time and less sensitive to the spectrum itself. This kind of processing helps systems handle the sometimes drastic differences in spectrum between the data recorded training the models and the data used for testing them.

For example, most speech systems at the time had difficulty recognizing audio recorded on different microphones from those used to record its training data. With RASTA, what’s important is the change from one moment to the next, not the absolute audio spectrum at any given point. This means that differences between microphones become less important in the speech recognition process. Morgan said the early work on RASTA features, as well as much more recent successes, stressed the importance of front-end processing. Speaking of the problem of training/test spectral mismatch, he said “We woke people up to the fact that this was a problem” . “We weren’t the first people to suggest that, but we were the first ones to talk about it so loudly.”

RASTA was used in ICSI’s Berkeley Restaurant Project (BeRP), a spoken dialog system that gave restaurant recommendations. The system was unusual in that both the system and its users could initiate questions, and the system could continue a conversation even when users did not respond directly to its questions. BeRP incorporated a speech recognizer, a natural language backend that parsed words and produced database queries, a restaurant database, and accent detection and modeling algorithms that helped the system understand foreign accents and nonstandard pronunciations. The system was developed by Morgan, postdoctoral fellows Gary Tajchmann and Dan Jurafsky, and graduate students Chuck Wooters and Jonathan Segal.

Wooters, who was Morgan’s first graduate student and who recently returned to ICSI as a senior researcher, said the system had a tight integration between natural language understanding and speech recognition. “You didn’t think of the speech recognizer as static,” he said. “It was more of a living system.”

RASTA is an example of technology emulating human systems, a theme throughout much of the speech recognition work at ICSI. “It’s really important to pay attention to what mechanisms we can discover from biological systems,” Morgan said.

His doctoral thesis work was on digitally reproducing some of the effects of room acoustics on speech and music. His approach relied on aspects of human perception. When a sound is made inside a room, a human listener first hears the sound that travels to his ears directly, and then hears the reverberations as the sound bounces off different surfaces. An analysis by Leo Beranek (cofounder of BBN) in the 1960s had shown that the early sounds were special: concert halls preferred by conductors were similar in that the time between the first sound and the first reverberation was about the same. Morgan said he realized, “Maybe there’s something critically important about those first sounds.” He built hardware and software to reproduce the early sounds in detail, while using a coarser method to approximate the later reverberation. The system performed well: in listening tests, study participants highly correlated the reproduced sounds with the correct room characteristics. The work incorporated ideas from psychoacoustics, which studies how audio stimuli affect perceptual processes.

The Speech Group is also interested in the effect that physiology has on the perception of audio. Research has shown that certain parts of the auditory system, of both humans and animals, are more attuned to certain aspects of audio. “The physiology gives a clue to some things that were harder to notice with just perception,” Morgan said. “It’s a potential source of inspiration for the kinds of things that we want to achieve.”

Early work on machine learning was greatly inspired by models of neurons, and artificial neural networks have been used for parts of some speech recognition systems for decades. Morgan has worked with neural networks since his days as a researcher at National Semiconductor, where he used neural networks in a speech analysis system. Then, in the 1990s, Morgan collaborated with Hervé Bourlard, now the director of IDIAP, on a hybrid approach that used neural networks statistically with HMMs. HMMs give the probability that a piece of sound is a particular word (or part of a word or a sentence). They are used in almost all current speech recognizers.

HMMs require a set of acoustic probabilities (that is, how likely it is that a chunk of sound corresponds to a particular speech sound like “ba”). In Bourlard and Morgan’s hybrid system, those probabilities are determined by an artificial neural network. Bourlard and Morgan’s paper on the approach, summarizing their joint work over the previous 7 years, won an IEEE Signal Processing Magazine best paper award from the Signal Processing Society in 1996, and their work together inspired other research directions throughout the 1990s. The hybrid approach is now experiencing a comeback with the growing popularity of work on deep learning.

By the mid 1990s, the Speech Group was looking for more difficult problems. Morgan said, “We were mostly looking at robustness in some sense – why are speech recognition systems breaking down? How do you make them less sensitive?”

A student suggested that Morgan, who was on his way to a meeting in Europe, keep track of times when speech recognition would have been useful for a handheld device, for instance to do calendar functions (such as now can be done with Siri). When he returned, Morgan realized he needed, not a personal electronic assistant, but some easy way of recording and retrieving notes from the meeting.

“All of the sudden it struck me: that’s the application that would appeal to me. You want to be able to have access to information from some extended meeting or meetings by querying for it,” he said. Perhaps more importantly for Morgan, such an application would drive research in many areas.

From this idea emerged the ICSI Meeting Corpus, a collection of recorded audio from meetings held at the Institute, along with transcriptions to aid in training speech recognition systems. At the time, it was the largest corpus of transcribed meetings available.

It was important that these recordings were of spontaneous speech. It included laughter, speech from multiple people talking at the same time, and vocalized pauses – “ums” and so forth. It also included speech recorded far from the speaker. These elements made for interesting problems in speech recognition, which the team set about solving.

The main focus of Morgan’s work has always been to solve fundamental problems rather than to push for incremental improvements in speech technology. For example, Speech Group researchers are currently developing ways to build speech recognizers for languages that don’t have much training data. Morgan said this will force the team to significantly alter speech recognition methods.

“There’s a lot to be gained in using lots of data,” he said. “But it can mask how dumb your model is.”

Another problem facing speech recognizers is how to handle speech with lots of background noise. Humans are generally pretty good at distinguishing noise and speech; machines are not. “The best-functioning systems try to get around this by having people speak directly into microphones,” Morgan said. But that technique, of course, doesn’t address the fundamental problems, and doesn’t cover many practical situations.

One problem is that speech recognition relies on algorithms that have been used since the 1960s. Speech Group researcher Steven Wegmann said that HMMs rely on assumptions that are “really strong and really wrong.”

Morgan said in a recent interview on Marketplace Tech that improvements to speech technology come slowly because the algorithms used for speech recognition – HMMs – have not changed since the 1960s. “The major source of improvements has been the speed-up improvements in computers,” he said. “But it’s still the same basic fundamental algorithms.”

Now, Wegmann and his colleagues at ICSI are trying to figure out what’s wrong with HMMs. They will do this by simulating data and statistically analyzing errors. The project also includes a survey of experts in speech recognition to get their opinions on where the technology is failing, what has been tried that didn’t work, and what still looks promising.