Parallel Data Layout Optimization of Scientific Data through Access-driven Replication
|Title||Parallel Data Layout Optimization of Scientific Data through Access-driven Replication|
|Year of Publication||2014|
|Authors||Jenkins, J, Zou, X, Tang, H, Kimpe, D, Ross, RB, Samatova, NF|
Efficient I/O on large-scale spatio-temporal scientific data requires scrutiny of both the logical layout of the data (e.g., row-major vs. column-major) and the physical layout (e.g., distribution on parallel filesystems). For increasingly complex datasets, hand optimization is a difficult matter prone to error and not scalable to the increasing heterogeneity of analysis work- loads. Given these factors, we present a partial data replication system called RADAR. We capture datatype- and collective-aware I/O access patterns (indicating logical access) via MPI- IO tracing and use a combination of coarse-grained and fine- grained performance modeling to evaluate and select optimized physical data distributions for the task at hand. Compared with existing methods, we store all replica data and metadata, along with the original, untouched data, under a single file container using the object abstraction in parallel filesystems. Our system can produce up to many-fold improvements in commonly used subvolume decomposition access patterns, while the modeling approach is capable of determining whether such optimizations should be undertaken in the first place.