Selected Work
 
 

The task of speaker diarization is to segment and cluster a speech recording into speaker homogenous regions, i.e. given an audio track of a meeting, the system has to discriminate and label the different speakers automatically ("who spoke when?"), this includes tasks like “when is there speech?” (speech/non-speech detection) and “who is overlapping with whom?” (overlap detection and resolution).


Currently, we are trying to improve the efficiency of current approaches as well as creating online algorithms ("who is speaking now?"). In addition, we are also exploring different applications on top of diarization, such inferring behavioral categories of a person according to speaking length and/or interruptions or semantic navigation in TV shows (see above). 

Multimodal Speaker Diarization and Localization

Recent Publications:

  1. D. Imseng, G. Friedland: Tuning-Robust Initialization Methods for Speaker Diarization, IEEE Transactions on Audio, Speech and Language Processing, to appear 2010. (download preliminary PDF)

  2. G. Friedland, C. Yeo. H. Hung: Visual Speaker Localization Aided by Acoustic Models, Proceedings  of ACM Multimedia, pp. 195-202, Beijing, China, October 2009. (download PDF)

  3. K. Boakye, O. Vinyals, G. Friedland: Two's a Crowd: Improving Speaker Diarization by Automatically Identifying and Excluding Overlapped Speech, Proceedings of Interspeech 2008, pp. 32-35, Brisbane Australia, September 2008. (download PDF)

Main Publications:

  1. G. Friedland, K. Jantz, T. Lenz, F. Wiesel, R. Rojas: Object Cut and Paste in Images and Videos, International Journal of Semantic Computing Vol 1, No 2, pp. 221-247, World Scientific, USA, June 2007. (download PDF)

  2. G. Friedland, K. Jantz, R. Rojas: SIOX: Simple Interactive Object Extraction in Still Images, Proceedings of the 7th IEEE Symposium on Multimedia (ISM2005), pp. 253-259, Irvine, California, December 2005. (download PDF)

Simple Interactive Object Extraction

Input:

Output:

Input:

Output:

SIOX is a method for extracting foreground objects with sub-pixel accuracy from images and videos with little user interaction. It has been integrated into several open-source image manipulation programs such as GIMP, Inkscape, Blender, and NIH’s ImageJ library.  SIOX improvement in GIMP was sponsored by Google through the Summer of Code 2009.

Website: http://www.siox.org

  1. Main Publications:

  2. G. Friedland, K. Pauls: Architecting Multimedia Environments for Teaching, IEEE Computer, vol. 38, no. 6, pp. 57-64, June 2005. (download PDF)

  3. G. Friedland, K. Pauls: Towards a Demand Driven, Autonomous Processing and Streaming Architecture, Proceedings of Workshop on Engineering of Autonomic Systems 2005 (EASe'05) at the 12th Annual IEEE International Conference on the Engineering of Computer Based Systems (ECBS 2005), pp. 473, Greenbelt, Maryland, April 2005. (download PDF)

E-Chalk: An Update of the Traditional Chalkboard

The E-Chalk project aims at updating the traditional chalkboard. While keeping the padagogic advantages and the handling of this established teaching tool, we want to introduce further capabilities such as automatic handwriting recognition and the possibility to create web lectures without extra effort. Both hardware and software have been developed. Website: http://www.echalk.de

Main Publications:

  1. G. Friedland: Adaptive Audio and Video Processing for Electronic Chalkboard lectures, Ph.D. Thesis, Department of Mathematics and Computer Science, Freie Universität Berlin, October 2006, ISBN: 978-1-4303-0388-6. (download PDF)

  2. G. Friedland, L. Knipping, E. Tapia, R. Rojas: Teaching With an Intelligent Electronic Chalkboard, Proceedings of the Workshop on Effective Telepresence, ACM Multimedia, New York, October 2004. (download PDF)

SOPA: Self-Organizing Processing and Streaming Architecture

SOPA is the architecture of a multimedia component framework that is an essential part of the lecture recording system E-Chalk. The goal is to provide an easy to use framework where dynamically organized processing graphs are built out of components from various distributed sources. Website: http://www.sopa.inf.fu-berlin.de

This piece of software is a result of the work on multimodal speaker diarization and acoustic event detection. The idea is to enable semantic navigation in a TV show, allowing to jump directly to particular  punchlines or action sequences of a certain actor. Our system has won the first prize in the ACM Multimedia Grand Challenge. For a demo click here

Semantic Navigation in Broadcast TV

Recent Publications:

  1. G. Friedland, L. Gottlieb, A. Janin: Narrative-theme navigation for sitcoms supported by fan-generated scripts, accepted at AIEMPRo Workshop at ACM Multimedia, Florence, Italy, October 2010. (download PDF)

  2. G. Friedland, L. Gottlieb, A. Janin: Joke-o-mat: Browsing Sitcoms Punchline by Punchline, Proceedings of ACM Multimedia, pp. 1115-1116, Beijing, China, October 2009. (download PDF)

This project aims to leverage all the GPS-tagged media available on the web to be used as training set for an automatic location detector. The idea is that visual landmarks and acoustic environmental specifics might narrow down the possible recording location for a given image, video, or audio track). After intial results we  are now also investigating  the human accuracy baseline.

Multimodal Location Estimation

Earlier Work

This page features a selection of current and recent research and engineering work. Earlier work can be found on my old project page.

Recent Publications:

  1. J. Choi and G. Friedland: Data-Driven vs Semantic-Technology-Driven Tag-Based Video Location Estimation, to appear in IEEE International Conference on Semantic Computing (ICSC 2011), Palo Alto, CA, September 2011.

  2. G. Friedland, O. Vinyals, T. Darrell: Multimodal Location Estimation, accepted as full paper at ACM Multimedia, Florence, Italy, October 2010. (download PDF)

Global Inference and Privacy

This project aims at qualifying and quantifying the privacy implications of of people sharing their life with other people in the Internet. It is a results of the feedback received on the Cybercasing article (see below) which was featured in the press (eg. ABC News, New York Times). I am mostly concentrating on the implications of the analysis of multimedia data.



Recent Publications:

  1. H. Lei, J. Choi, A. Janin, and G. Friedland: Persona Linking: Matching Uploaders of Videos Accross Accounts, IEEE International Conference on Acoustic, Speech, and Signal Processing (ICASSP), Prague, May 2011 (download PDF).

  2. G. Friedland, R. Sommer: Cybercasing the Joint: On the Privacy Implications of Geotagging, accepted for Usenix HotSec 2010 at the Usenix Security Conference, Washington DC, August 2010. (download PDF)

Grounded Multimodal Language Acquisition

This project aims at training a robot to behave like a two-year old regarding language acquisition of objects and actions perceived via visual, audio, and tactile sensors. This project is a collaboration between ICSI, UC Berkeley, and UPenn. More info soon.



Past Projects

Ongoing Projects

Duplicate Video Detection using Acoustic Methods

This project aims at detecting duplicate parts of videos (a duplicate is an identical but not bit-identical subset of another video) using only acoustic methods. More info soon.



Video Event Recounting using Acoustic and Multimodal Methods

This project aims at describing the content of a video based on a set of example videos. The computer learns concepts from example videos and then recaounts the concepts seen in the query videos. The task is performed on a large scale (100k) of “wild” Internet videos.



Recent Publications:

  1. Robert Mertens, Howard Lei, Luke Gottlieb, Gerald Friedland, Ajay Divakarian: Acoustic Super Models for Large Scale Video Event Detection, Joint ACM Workshop on Modelling and representing Events, ACM Multimedia 2011, Scottsdale, AZ, December 2011.

  2. Po-Sen Huang, Robert Mertens, Ajay Divakaran, Gerald Friedland, Mark Hasegawa-Johnson: How to Put it into Words -- Using Random Forests to Extract Symbol Level Descriptions from Audio Content for Concept Detection, accepted for IEEE ICASSP, Kyoto, Japan, March 2012.