Argonne National Laboratory

Shock: Active Storage for Multicloud Streaming Data Analysis

TitleShock: Active Storage for Multicloud Streaming Data Analysis
Publication TypeConference Paper
Year of Publication2015
AuthorsBischof, J, Wilke, A, Gerlach, W, Harrison, T, Paczian, T, Tang, W, Wilkening, J, Desai, N, Meyer, F
Conference Name2nd International Symposium on Big Data Computing (BDC)
Date Published12/2015
Conference LocationLimassol
Other NumbersANL/MCS-P5406-0915
AbstractAccess to data plays a major role in designing and performing efficient data computation and analyses in a distributed environment. Existing approaches access data via a variety of methods and offer various benefits and drawbacks based on the use case. Our original use case was the computational analysis of environmental sequence data, or metagenomics. Unlike other workflows that often reduce the dataset size dramatically within the first few processing steps, retagenomic workflows add to the size of the data set along the processing pipeline, Thus, wide-area, high-throughput access to the data is essential. To address this problem, we developed Shock, a data store for files, their associated metadata, and indexes that allow Shock to provide different views into the data. Shock comprises three major components: a web service that provides a RESTful API, backend data storage for files, and storage for object metadata. Shock has proven to be a stable data store for MG-RAST, an application that served over 40,000 users in 2014 on a server that houses more than 3 million data objects. Moreover, Shock provides both subselection and high-performance file transfer capabilities that serve most usages.