Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Many important “big data” applications need to processdata arriving in real time. However, current programmingmodels for distributed stream processing are relativelylow-level, often leaving the user to worry aboutconsistency of state across the system and fault recovery.Furthermore, the models that provide fault recoverydo so in an expensive manner, requiring either hot replicationor long recovery times. We propose a new programmingmodel, discretized streams (D-Streams), thatoffers a high-level functional programming API, strongconsistency, and efficient fault recovery. D-Streams supporta new recovery mechanism that improves efficiencyover the traditional replication and upstream backup solutionsin streaming databases: parallel recovery of loststate across the cluster. We have prototyped D-Streams inan extension to the Spark cluster computing frameworkcalled Spark Streaming, which lets users seamlessly intermixstreaming, batch and interactive queries.


