Shark: SQL and Rich Analytics at Scale

TitleShark: SQL and Rich Analytics at Scale
Publication TypeTechnical Report
Year of Publication2012
AuthorsXin, R., Rosen J., Zaharia M., Franklin M. J., Shenker S., & Stoica I.
Other Numbers3422

Shark is a new data analysis system that marries query processingwith complex analytics on large clusters. It leverages a noveldistributed memory abstraction to provide a unified engine thatcan run SQL queries and sophisticated analytics functions (e.g., iterativemachine learning) at scale, and efficiently recovers fromfailures mid-query. This allows Shark to run SQL queries up to100 faster than Apache Hive, and machine learning programsup to 100 faster than Hadoop. Unlike previous systems, Sharkshows that it is possible to achieve these speedups while retaininga MapReduce-like execution engine, and the fine-grained faulttolerance properties that such engines provide. It extends such anengine in several ways, including column-oriented in-memory storageand dynamic mid-query replanning, to effectively execute SQL.The result is a system that matches the speedups reported for MPPanalytic databases over MapReduce, while offering fault toleranceproperties and complex analytics capabilities that they lack.


We thank Cliff Engle, Harvey Feng, Shivaram Venkataraman, RamSriharsha, Denny Britz, Antonio Lupher, Patrick Wendell, and PaulRuan for their work on Shark. This research is supported in part byNSF CISE Expeditions award CCF-1139158, gifts from AmazonWeb Services, Google, SAP, Blue Goji, Cisco, Cloudera, Ericsson,General Electric, Hewlett Packard, Huawei, Intel, Microsoft, NetApp,Oracle, Quanta, Splunk, VMware and by DARPA (contract#FA8650-11-C-7136).

Bibliographic Notes

Technical Report, UCB/EECS-2012-214, University of California at Berkeley, Department of Electrical Engineering and Computer Science, arXiv:1211.6176 [cs.DB]

Abbreviated Authors

R. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica

ICSI Research Group

Networking and Security

ICSI Publication Type

Technical Report