Argonne National Laboratory

Providing Efficient I/O Redundancy in MPI Environments

TitleProviding Efficient I/O Redundancy in MPI Environments
Publication TypeReport
Year of Publication2004
AuthorsGropp, WD, Ross, RB, Miller, N
Date Published06/2004
Other NumbersANL/MCS-P1178-0604

Highly parallel applications often make use of either highly parallel file systems or large numbers of independent disks. Either of these approaches can provide high data rates necessary for parallel applications. However, the failure of a single disk or server can render the data useless. Conventional techniques, such as those based on applying erasure correcting codes to each file write, are prohibitively expensive for massively parallel scientific applications because of the granularity of access at which codes are applied. In this paper we demonstrate a scalable method for recovering from single disk failures that is optimized for typical scientific data sets. This approach exploits coarser-grain (but precise) semantics to reduce the overhead of constructing recovery data and makes use of parallel computation (proportional to the data size and independent of number of processors) to construct data. Experiments showing the efficiency of this approach on a cluster with independent disks are present, and a technique for hiding the creation of redundant data within the MPI-IO implementation is described.