Big Data Projects

Theory and Practice of Randomized Algorithms for Ultra-Large-Scale Signal Processing

The dramatic increase in our abilities to observe massive amounts of measurements coming from distributed and disparate high-resolution sensors have been instrumental in enhancing our understanding of many physical phenomena. Signal processing (SP) has been the primary driving force in this knowledge of the unseen from observed measurements. However, in the last decade, the exponential increase in observations has outpaced our computing abilities to process, understand, and organize this massive but useful information.

Combining stochastics and numerics for more meaningful matrix computations

The amount of data in our world has exploded, with data being at the heart of modern economic activity, innovation, and growth. In many cases, data are modeled as matrices, since an m x n matrix A provides a natural structure to encode information about m objects, each of which is described by n features. As a result, linear algebraic algorithms, and in particular matrix decompositions, have proven extremely successful in the analysis of datasets in the form of matrices.

Robust, Efficient, and Local Machine Learning Primitives

The large-scale data being generated in many application domains promise to revolutionize scientific discovery, engineering and technological development, social science understanding, and our ability to monitor masses and influence behavior in subtle ways. In most applications, however, this promise has yet to be fulfilled. One major reason for this is the dificulty of using, in a low-friction manner, cutting-edge algorithmic and statistical tools to explore the data and develop domain-informed models of the processes generating the data.

Streaming Algorithms for Fundamental Computations in Numerical Linear Algebra

Streaming algorithms that use every input datum once (single-pass) or scan the input a small number of times (multiple passes) are gaining importance due to the increasing volumes of data that are available for business, scientific, and security applications. Performing large-scale data analysis and machine learning often requires addressing numerical linear algebra primitives, such as L2 regression, singular value decompositions, L1 regression, and canonical correlations.

Machine Learning Methods and Large Informatics Graphs

In this project, researchers are tackling several problems with machine learning methods and large informatics graphs. First, they are looking at local algorithms and locally-biased algorithms, specifically extending local algorithms to other objective functions and the characterization of statistical properties of local algorithms. Second, they are scaling the algorithms up to larger networks, focusing on scaling up strongly-local and locally-biased methods and implementations on graphs that do not fit into RAM.

Characterizing and Exploiting Tree-Like Structure in Large Social and Information Networks

In this project, researchers are developing methods to characterize and exploit "tree-like" structure in realistic social and information networks. In particular, they are focused on two related but complementary notions of tree-like-ness, as well as related heuristic variants, for graphs. These notions will be used to develop tools to characterize the manner in which realistic complex networks are coarsely tree-like, and this characterization will be used to develop tools for improved analytics on realistic networks.

Randomized Numerical Linear Algebra (RandNLA) for Multi-Linear and Non-Linear Data

This project investigates two important, non-linear, structural settings in order to start making progress toward using RandNLA (Randomized Numerical Linear Algebra) approaches to big data analysis in situations where the underlying data exhibit non-linear structure. First, researchers investigate how to design the next generation of RandNLA algorithms that can handle data that exhibit multi-linear structures captured by tensors.

The Berkeley Data Analysis System

In this project, researchers at ICSI are extending and applying recent work on randomized algorithms for matrix-based machine learning problems to the computational infrastructure recently developed at the AMPLab, UC Berkeley. One of the challenges in large-scale machine learning is that MapReduce/Hadoop does not perform well for iterative algorithms that are common in matrix-based machine learning. Examples of such iterative algorithms include common algorithms for least-squares approximation, least absolute deviations approximation, low-rank matrix approximation, etc.

Leverage Subsampling for Regression and Dimension Reduction

In this collaborative project between UC Berkeley, University of Illinois, Urbana-Champaign, and ICSI, scientists are working toward an integrated treatment of statistical and computational issues. The first research thrust focuses on studying the statistical properties of the subsampling estimation using the statistical leverage scores in linear regression. The second research thrust generalizes the theory and methods to nonlinear regression and dimension reduction models.

Scalable Statistics and Machine Learning for Data-Centric Science

Researchers from Lawrence Berkeley Laboratory, UC Berkeley, and ICSI are developing and applying new statistics and machine learning algorithms that can operate on real-world datasets produced by a diverse range of experimental and observational facilities. This is a critical capability in facilitating big data analysis, which will be essential for scientific progress in the foreseeable future.