Rapid Creation of Keyword Search Systems

Monday, July 22, 2013

Speech researchers at ICSI are hard at work on the Swordfish project, an ambitious effort to rapidly generate keyword search systems in a new language with modest resources. The ultimate goal would be to develop a system for a new language in a week using only a small amount of labeled training data (as little as 10 hours - most ASR systems are trained on hundreds or thousands of hours of labeled data).

In the first year, ICSI's team, which includes collaborators from Columbia, Northwestern, Ohio State, and University of Washington, completed two systems, an HTK-based system as well as a Kaldi-based system, for five different languages.

The Swordfish team particularly focused on analysis methods that showed what was being improved when systems were combined, and also on upper bounds for improvements of the estimation of how probably each keyword was. Analysis methods like these can inform future research and ultimately lead to better systems.

The team has been working on enhancements for all parts of the process, including a novel approach to pitch tracking. This has turned out to be useful for the ICSI systems, even for non-tonal languages. Other areas of study include morphological analysis and novel language modeling approachins. A key contribution of the effort is the System fo Running Systems (SRS), a set of scripts and libraries that greatly ease the running of complex software systems like the ones developed in this project.

Swordfish is funded through the Babel Program of the Intelligence Advanced Research Projects Activtity (IARPA), a research arm of the Office of the Director of National Intelligence, which invests in high-risk/high-payoff research programs.