S. Narayan, J. A. Chandy, S. Lang, P. Carns, and R. Ross, "Uncovering Errors: The Cost of Detecting Silent Data Corruption," Preprint ANL/MCS-P1689-1009, October 2009.
Data integrity is pivotal to the usefulness of any storage system. It ensures that the data stored is free from any modification throughout its existence on the storage medium. Hash functions such as cyclic redundancy checks (CRCs) or checksums are frequently used to detect data corruption during its transmission to permanent storage or its stay there. Without these checks, such data errors usually go undetected and unreported to the system and hence are not communicated to the application. They are referred as “silent data corruption.” When an application reads corrupted or malformed data, it leads to incorrect results or a failed system. Storage arrays in leadership computing facilities are comprised of several thousands of drives thus increasing the likelihood of such failures. These environments mandate a file system capable of detecting data corruption. Parallel file systems have traditionally ignored providing integrity checks due to its high computational cost, particularly in dealing with unaligned data request from scientific applications. In this paper, we provide a brief assessment of the cost of providing data integrity on a parallel file system. We present an approach that provides this capability with as low as 20% overhead for aligned data request and a negligible additional cost for unaligned requests.