The Quest for Scalable Support of Data-Intensive Workloads in Distributed Systems

TitleThe Quest for Scalable Support of Data-Intensive Workloads in Distributed Systems
Publication TypeConference Paper
Year of Publication2009
AuthorsRaicu, I, Foster, IT, Zhao, Y, Little, P, Moretti, CM, Choudhary, A, Thain, D
Conference NameProceedings of the 18th ACM international symposium on High performance distributed computing
Date Published03/2009
Conference LocationGarching, Germany
Other NumbersANL/MCS-P1596-0409

Data-intensive applications involving the analysis of large datasets often require large amounts of compute and storage resources, for which data locality can be crucial to high throughput and performance. We propose a �data diffusion� approach that acquires compute and storage resources dynamically, replicates data in response to demand, and schedules computations close to data. As demand increases, more resources are acquired, thus allowing faster response to subsequent requests that refer to the same data; when demand drops, resources are
released. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on workload and resource characteristics. To explore the feasibility of data diffusion, we offer both a theoretical and an empirical analysis. We define an abstract model for data diffusion, introduce new scheduling policies with heuristics to optimize real-world performance, and develop a competitive online cache eviction policy. We also offer many empirical experiments to explore the benefits of dynamically expanding and contracting resources based on load, to improve system responsiveness while keeping wasted resources small. We show performance improvements of one to two orders of magnitude across three diverse workloads when compared to the performance of parallel file systems with throughputs approaching 80 Gb/s on a modest cluster of 200 processors. We also compare data diffusion with a best model for active storage, contrasting the difference between a pull-model found in data diffusion and a push-model found in active storage.