"Broadcast News segmentation using Metadata extraction and Speech-to-text
information to improve Speech Recognition"
My talk presents the work I've been doing here in ICSI over the last 4 months. This work has been done within the framework of the EARS Rich Transcription program, using both the Metadata and Speech-To-Text capabilities.
Perfect segmentation of the audio stream (based on known speaker identities, sentence boundaries,...) has the property to lower the Word Error Rate (WER) given by an Automatic Speech Recognition (ASR) system compared to the WER obtained by the same system fed with automatic segments. Therefore, by generating these automatic segments in a way that makes them as close as possible to so-called perfect segments, we can reduce the WER of the system.
In this work, the process described above is applied in the framework of the Broadcast News (BN) speech corpus. The only tools used to achieve this goal are Metadata extraction (MDE) and Speech-To-Text (STT) information.