Fast Speaker Diarization Using a High-Level Scripting Language

TitleFast Speaker Diarization Using a High-Level Scripting Language
Publication TypeConference Paper
Year of Publication2011
AuthorsGonina, E., Friedland G., Cook H., & Keutzer K.
Other Numbers3200

Most current speaker diarization systems use agglomerativeclustering of Gaussian Mixture Models (GMMs) todetermine “who spoke when” in an audio recording. While stateof-the-art in accuracy, this method is computationally costly,mostly due to the GMM training, and thus limits the performanceof current approaches to be roughly real-time. Increased sizesof current datasets require processing of hundreds of hours ofdata and thus make more efficient processing methods highlydesirable. With the emergence of highly parallel multicore andmanycore processors, such as graphics processing units (GPUs),one can re-implement GMM training to achieve faster thanreal-time performance by taking advantage of parallelism inthe training computation. However, developing and maintainingthe complex low-level GPU code is difficult and requires adeep understanding of the hardware architecture of the parallelprocessor. Furthermore, such low-level implementations are notreadily reusable in other applications and not portable to otherplatforms, limiting programmer productivity. In this paper wepresent a speaker diarization system captured in under 50 lines ofPython that achieves 50-250× faster than real-time performanceby using a specialization framework to automatically map andexecute computationally intensive GMM training on an NVIDIAGPU, without significant loss in accuracy.


This work was partially supported by funding provided to ICSI by the U.S. Defense Advanced Research Projects Agency (DARPA). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors or originators and do not necessarily reflect the views of DARPA or of the U.S. Government. This work was also partially supported by funding provided by CISCO, Microsoft, Intel, and U.C. Discovery.

Bibliographic Notes

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2011), Big Island, Hawaii

Abbreviated Authors

E. Gonina, G. Friedland, H. Cook, and K. Keutzer

ICSI Research Group

Audio and Multimedia

ICSI Publication Type

Article in conference proceedings