Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

TitleResilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
Publication TypeConference Paper
Year of Publication2012
AuthorsZaharia, M., Chowdhury M., Das T., Dave A., Ma J., McCauley M., Franklin M. J., Shenker S., & Stoica I.
Other Numbers3354

We present Resilient Distributed Datasets (RDDs), a distributedmemory abstraction that lets programmers performin-memory computations on large clusters in afault-tolerant manner. RDDs are motivated by two typesof applications that current computing frameworks handleinefficiently: iterative algorithms and interactive datamining tools. In both cases, keeping data in memorycan improve performance by an order of magnitude.To achieve fault tolerance efficiently, RDDs provide arestricted form of shared memory, based on coarsegrainedtransformations rather than fine-grained updatesto shared state. However, we show that RDDs are expressiveenough to capture a wide class of computations, includingrecent specialized programming models for iterativejobs, such as Pregel, and new applications that thesemodels do not capture.We have implemented RDDs in asystem called Spark, which we evaluate through a varietyof user applications and benchmarks.


This research was supported in part by Berkeley AMP Lab sponsors Google, SAP, Amazon Web Services, Cloudera, Huawei, IBM, Intel, Microsoft, NEC, NetApp and VMWare, by DARPA (contract #FA8650-11-C-7136), by a Google PhD Fellowship, and by the Natural Sciences and Engineering Research Council of Canada.

Bibliographic Notes

Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2012), pp. 1-14, San Jose, CA

Abbreviated Authors

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, and I. Stoica

ICSI Research Group

Networking and Security

ICSI Publication Type

Article in conference proceedings