|Title||Shock: Active Storage for Multicloud Streaming Data Analysis |
|Publication Type||Conference Paper |
|Year of Publication||2015 |
|Authors||Bischof, J, Wilke, A, Gerlach, W, Harrison, T, Paczian, T, Tang, W, Wilkening, J, Desai, N, Meyer, F |
|Conference Name||2nd International Symposium on Big Data Computing (BDC) |
|Date Published||12/2015 |
|Conference Location||Limassol |
|Other Numbers||ANL/MCS-P5406-0915 |
|Abstract||Access to data plays a major role in designing and performing efficient data computation and analyses in a distributed environment. Existing approaches access data via a variety of methods and offer various benefits and drawbacks based on the use case. Our original use case was the computational analysis of environmental sequence data, or metagenomics. Unlike other workflows that often reduce the dataset size dramatically within the first few processing steps, retagenomic workflows add to the size of the data set along the processing pipeline, Thus, wide-area, high-throughput access to the data is essential.
To address this problem, we developed Shock, a data store for files, their associated metadata, and indexes that allow Shock to provide different views into the data. Shock comprises three major components: a web service that provides a RESTful API, backend data storage for files, and storage for object metadata. Shock has proven to be a stable data store for MG-RAST, an application that served over 40,000 users in 2014 on a server that houses more than 3 million data objects. Moreover, Shock provides both subselection and high-performance file transfer capabilities that serve most usages.