# Big Data Projects

## Real-Time Data Reduction Codesign at the Extreme Edge for Science

This project focuses on intelligent ML-based data reduction and processing as close as possible to the data source. Per sensor compression and efficient aggregation of information while preserving scientific fidelity can have a huge impact on data rates further downstream and the way that experiments are designed and operated. The research team is concentrating on powerful, specialized compute hardware at the extreme edge—such as FPGAs, ASICs, and systems-on-chip—which are typical initial processing layers of many experiments.

## Scalable linear algebra and neural network theory

While deep learning methods have in no doubt transformed certain applications of machine learning (ML) such as Computer Vision (CV) and Natural Language Processing (NLP), its promised impact on many other areas has yet to be seen. The reason for this is the flip side of why it has been successful where it has.

## Scalable Second-order Methods for Training, Designing, and Deploying Machine Learning Models

Scalable algorithms that can handle the large-scale nature of modern datasets are an integral part of many applications of machine learning (ML). Among these, efficient optimization algorithms, as the bread and butter of many ML methods, hold a special place. Optimization methods that use only first derivative information, i.e., first-order methods, are the most common tools used in training ML models. This is despite the fact that many of these methods come with inherent disadvantages such as slow convergence, poor communication, and the need for laborious hyper-parameter tuning.

## Backdoor Detection via Eigenvalues, Hessians, Internal Behaviors, and Robust Statistics

Although Deep Neural Networks (DNNs) have achieved impressive performance in several applications, there are several by now well-known sensitivities that they exhibit. Perhaps the most prominent of these is sensitivity in various types of adversarial environments. As an example of this, recall that it is common in practice to outsource the training of a model (which is known as Machine Learning as a Service, MLaaS) or to use third-party pre-trained networks (and then perform fine-tuning or transfer learning).

## Previous Work: Implement and Evaluate Matrix Algorithms in Spark on High Performance Computing Platforms for Science Applications

The overall goal of this project is to enable the Berkeley Data Analytics Stack (BDAS) to run efficiently on the Cray XC30 and Cray XC40 supercomputer platforms. BDAS has a rich set of capabilities and is of interest as a computational environment for very large-scale machine learning and data analysis applications. To extend the capabilities of BDAS, ICSI researchers will consider the performance of deterministic and randomized matrix algorithms for problems such as least-squares approximation and low-rank matrix approximation that underlie many common machine-learning algorithms.

## Previous Work: Local Algorithms for Large Informatics Graphs

A serious problem with many existing machine learning and data analysis tools in the complex networks area is that they are often very brittle and/or do not scale well to larger networks. As a consequence, analysts often develop intuition on small networks, with 10^{2} or 10^{3} nodes, and then try to apply these methods on larger networks, with 10^{5} or 10^{7} or more nodes. Larger networks, however, often have very different static and dynamic properties than smaller networks.

## Theory and Practice of Randomized Algorithms for Ultra-Large-Scale Signal Processing

The dramatic increase in our abilities to observe massive amounts of measurements coming from distributed and disparate high-resolution sensors have been instrumental in enhancing our understanding of many physical phenomena. Signal processing (SP) has been the primary driving force in this knowledge of the unseen from observed measurements. However, in the last decade, the exponential increase in observations has outpaced our computing abilities to process, understand, and organize this massive but useful information.

## Combining stochastics and numerics for more meaningful matrix computations

The amount of data in our world has exploded, with data being at the heart of modern economic activity, innovation, and growth. In many cases, data are modeled as matrices, since an m x n matrix A provides a natural structure to encode information about m objects, each of which is described by n features. As a result, linear algebraic algorithms, and in particular matrix decompositions, have proven extremely successful in the analysis of datasets in the form of matrices.

## Robust, Efficient, and Local Machine Learning Primitives

The large-scale data being generated in many application domains promise to revolutionize scientific discovery, engineering and technological development, social science understanding, and our ability to monitor masses and influence behavior in subtle ways. In most applications, however, this promise has yet to be fulfilled. One major reason for this is the dificulty of using, in a low-friction manner, cutting-edge algorithmic and statistical tools to explore the data and develop domain-informed models of the processes generating the data.

## Previous Work: Streaming Algorithms for Fundamental Computations in Numerical Linear Algebra

Streaming algorithms that use every input datum once (single-pass) or scan the input a small number of times (multiple passes) are gaining importance due to the increasing volumes of data that are available for business, scientific, and security applications. Performing large-scale data analysis and machine learning often requires addressing numerical linear algebra primitives, such as L2 regression, singular value decompositions, L1 regression, and canonical correlations.

## Previous Work: Machine Learning Methods and Large Informatics Graphs

In this project, researchers are tackling several problems with machine learning methods and large informatics graphs. First, they are looking at local algorithms and locally-biased algorithms, specifically extending local algorithms to other objective functions and the characterization of statistical properties of local algorithms. Second, they are scaling the algorithms up to larger networks, focusing on scaling up strongly-local and locally-biased methods and implementations on graphs that do not fit into RAM.

## Previous Work: Characterizing and Exploiting Tree-Like Structure in Large Social and Information Networks

In this project, researchers are developing methods to characterize and exploit "tree-like" structure in realistic social and information networks. In particular, they are focused on two related but complementary notions of tree-like-ness, as well as related heuristic variants, for graphs. These notions will be used to develop tools to characterize the manner in which realistic complex networks are coarsely tree-like, and this characterization will be used to develop tools for improved analytics on realistic networks.

## Randomized Numerical Linear Algebra (RandNLA) for Multi-Linear and Non-Linear Data

This project investigates two important, non-linear, structural settings in order to start making progress toward using RandNLA (Randomized Numerical Linear Algebra) approaches to big data analysis in situations where the underlying data exhibit non-linear structure. First, researchers investigate how to design the next generation of RandNLA algorithms that can handle data that exhibit multi-linear structures captured by tensors.

## Previous Work: The Berkeley Data Analysis System

In this project, researchers at ICSI are extending and applying recent work on randomized algorithms for matrix-based machine learning problems to the computational infrastructure recently developed at the AMPLab, UC Berkeley. One of the challenges in large-scale machine learning is that MapReduce/Hadoop does not perform well for iterative algorithms that are common in matrix-based machine learning. Examples of such iterative algorithms include common algorithms for least-squares approximation, least absolute deviations approximation, low-rank matrix approximation, etc.

## Previous Work: Leverage Subsampling for Regression and Dimension Reduction

In this collaborative project between UC Berkeley, University of Illinois, Urbana-Champaign, and ICSI, scientists are working toward an integrated treatment of statistical and computational issues. The first research thrust focuses on studying the statistical properties of the subsampling estimation using the statistical leverage scores in linear regression. The second research thrust generalizes the theory and methods to nonlinear regression and dimension reduction models.

## Scalable Statistics and Machine Learning for Data-Centric Science

Researchers from Lawrence Berkeley Laboratory, UC Berkeley, and ICSI are developing and applying new statistics and machine learning algorithms that can operate on real-world datasets produced by a diverse range of experimental and observational facilities. This is a critical capability in facilitating big data analysis, which will be essential for scientific progress in the foreseeable future.