Towards Programming Datacenters

Principal Investigator(s): 
Scott Shenker

Datacenters have redefined the nature of high-end computing, but harnessing their computing power remains a challenging task. Initially, programming frameworks such as MapReduce, Hadoop, Spark, TensorFlow, and Flink provided a way to run large-scale computations. These frameworks took care of the difficult issues of scaling, fault-tolerance, and consistency, freeing the developer to focus on the logic of their particular application. However, each of these frameworks were aimed at a specific computational task (e.g., machine learning, data analytics, etc.), and are not fully general. In addition, they required the person running the computation to explicitly manage the computing resources by choosing how many (and what type) of VMs to use.

More recently, many cloud providers now offer serverless computing, which allows customers to write applications composed of light-weight, short-lived functions triggered by events. These offer a general programming model, and resource allocation is handled by the cloud provider, mitigating two of the shortcomings of the previous programming frameworks. However, the serverless paradigm has limitations of its own: it does not handle fault-tolerance or consistency, and it performs poorly in certain scenarios.

This project aims to rectify these particular concerns through a system called Savanna, which acts as a file system for serverless computations and provides fault-tolerance, consistency, and caching for better performance. Next is a more ambitious goal -- programming the datacenter -- by seeking to take programs written for a single machine and transforming them to the serverless paradigm. Doing so requires techniques for (i) automatically finding inherent parallelism and (ii) efficiently checkpointing and recovering.

Funding provided by NSF