Publication Details
Title: Taming the Wild: Acoustic Segmentation in Consumer‐Produced Videos
Author: B. Elizalde and G. Friedland
Group: ICSI Technical Reports
Date: January 2013
PDF: http://www.icsi.berkeley.edu/pubs/techreports/ICSI_TR-12-016.pdf
Overview:
Audio segmentation is the process of partitioning data and identifying boundaries between different sounds, this task is commonly an early stage in speech processing tasks such as Automatic Speech Recognition (ASR) or Speaker Identification (SID). While traditional speech/non‐speech segmentation systems have been designed for specific data conditions such as broadcast news or meetings, the growth of web videos brings new challenges for segmenting consumer‐produced, aka ``wild," audio. This type of audio is an unstructured domain with little control over recording conditions. Despite the growth of ``wild" audio, little research has been done on this domain or on domain‐independent audio segmentation systems. The following paper attempts to close that gap by creating and testing a semi‐supervised approach with a Codebook‐Histogram Features (CHF) segmentation using Support Vector Machines (SVM) for speech detection in consumer‐produced videos. Using the web videos TRECVID MED 2011 dataset and a well‐known speech detection meetings corpus, training/testing data combinations were designed to evaluate and understand better the performance of this new approach in contrast to a state‐of‐the‐art traditional Gaussian Mixture Models (GMM) system. The results revealed that the CHF approach outperformed the GMM method by 50% detecting speech on meetings, but underperformed it by 44% on wild data. Furthermore, the CHF was 4 times faster at processing audio files at the testing stage.
Bibliographic Information:
ICSI Technical Report TR-12-016
Bibliographic Reference:
B. Elizalde and G. Friedland. Taming the Wild: Acoustic Segmentation in Consumer‐Produced Videos. ICSI Technical Report TR-12-016, January 2013
Author: B. Elizalde and G. Friedland
Group: ICSI Technical Reports
Date: January 2013
PDF: http://www.icsi.berkeley.edu/pubs/techreports/ICSI_TR-12-016.pdf
Overview:
Audio segmentation is the process of partitioning data and identifying boundaries between different sounds, this task is commonly an early stage in speech processing tasks such as Automatic Speech Recognition (ASR) or Speaker Identification (SID). While traditional speech/non‐speech segmentation systems have been designed for specific data conditions such as broadcast news or meetings, the growth of web videos brings new challenges for segmenting consumer‐produced, aka ``wild," audio. This type of audio is an unstructured domain with little control over recording conditions. Despite the growth of ``wild" audio, little research has been done on this domain or on domain‐independent audio segmentation systems. The following paper attempts to close that gap by creating and testing a semi‐supervised approach with a Codebook‐Histogram Features (CHF) segmentation using Support Vector Machines (SVM) for speech detection in consumer‐produced videos. Using the web videos TRECVID MED 2011 dataset and a well‐known speech detection meetings corpus, training/testing data combinations were designed to evaluate and understand better the performance of this new approach in contrast to a state‐of‐the‐art traditional Gaussian Mixture Models (GMM) system. The results revealed that the CHF approach outperformed the GMM method by 50% detecting speech on meetings, but underperformed it by 44% on wild data. Furthermore, the CHF was 4 times faster at processing audio files at the testing stage.
Bibliographic Information:
ICSI Technical Report TR-12-016
Bibliographic Reference:
B. Elizalde and G. Friedland. Taming the Wild: Acoustic Segmentation in Consumer‐Produced Videos. ICSI Technical Report TR-12-016, January 2013
