Scalable Statistics and Machine Learning for Data-Centric Science

Principal Investigator(s): 
Michael Mahoney

Researchers from Lawrence Berkeley Laboratory, UC Berkeley, and ICSI are developing and applying new statistics and machine learning algorithms that can operate on real-world datasets produced by a diverse range of experimental and observational facilities. This is a critical capability in facilitating big data analysis, which will be essential for scientific progress in the foreseeable future.

Scientists are working on randomized and approximation algorithms for dimensionality reduction and clustering. They are developing stochastic optimization techniques for large-scale inference, and extending deep learning algorithms to work on scientific spatio-temporal datasets. In addition, they are developing scalable graph algorithms that work directly on the input dataset without resorting to expensive computation of all-pairs similarities. These methods are being applied to a diverse range of analysis problems in cosmology, climate, bio-imaging, genomics, neuroscience, particle physics, and other domain sciences. The project aims to implement these algorithms in scalable codes that are capable of processing TB-sized datasets on petascale platforms. The interdisciplinary team of experts from LBL, ICSI, and UC Berkeley in statistics, machine learning, graph analytics high-performance computing, data management, and domain sciences is ideally positioned to tackle these challenges.

Funding provided by Lawrence Berkeley Laboratory.