The following 7 papers were accepted to give a presentation at the 2013 IASDS Workshop:
- Yusuke Tanimura, Rosa Filgueira, Isao Kojima and Malcolm Atkinson. Two-Phase Collective I/O based on Advanced Reservations to Obtain Performance Guarantees from Shared Storage Systems
- Saeideh Alinezhad and Seyed Ghassem Miremadi. Reliability Enhancement of SSD-based Storage Systems
- Navid Golpayegani, Milton Halem, Edward J. Masuoka, Neal K. Devine and Gang Ye. LVFS: A Scalable Big Data Scientific Storage System
- Gideon Nimako, Ekow Otoo and Daniel Ohene-Kwofie. PEXTA: A Parallel Chunked Extendible Dense Array I/O for Global Array (GA)
Seung Woo Son. Dynamic File Striping and Data Layout Transformation on Parallel System with Fluctuating I/O WorkloadAbstract: As the number of compute cores on modern parallel machines increases to more than hundreds of thousands, scalable and consistent I/O performance is becoming hard to obtain due to fluctuating file system performance. This fluctuation is often caused by rebuilding RAID disk from hardware failures or concurrent jobs competing for I/O. We present a mechanism that stripes across a dynamically-selected subset of I/O servers with the lightest workload to achieve the best I/O bandwidth available from the system. We implement this mechanism into an I/O software layer that enables memory-to-file data layout transformation and allows transparent file partitioning. File partitioning is a technique that divides data among a set of files and manages file access, making data appear as a single file to users. Experimental results on NERSC’s Hopper indicate that our approach effectively isolates I/O variation on shared systems and improves overall I/O performance significantly.
- Babak Behzad, Huong Luu and Marianne Winslett. A Multi-level Approach For Understanding HPC Applications’ I/O Activities
- Cengiz Karakoyunlu, Dries Kimpe, Philip Carns, Kevin Harms, Robert Ross and Lee Ward. Towards a Unified Object Storage Foundation for Scalable Storage Systems
Abstract: As more data-intensive computing applications are executed on high performance computing clusters, resource contention on the shared storage system attached to the clusters becomes significant. The contention might cause I/O performance degradation and spoil performance improvement of coordinated parallel I/O, such as the optimizations provided by the MPI-IO implementation. In order to solve this problem, an advanced reservation approach where storage resources are managed based on the reservations to satisfy the I/O performance requirements, has been proposed. In this paper, we apply the concept of reserved data access to MPI-IO, in particular to Two-Phase collective I/O which is primarily used for I/O aggregation in non-contiguous access by MPI applications. We developed a prototype by using Dynamic-CoMPI which supports further improvement of Two-Phase I/O by using a locality aware strategy, and Papio which is a parallel storage system providing performance reservation functionality. After describing our prototype design and implementation, we show leverage of the concept by comparing our implementation with other existing MPI-IO implementations backed by OrangeFS and Lustre. The evaluation experiment confirms that the optimization benefit of Two-Phase I/O can be preserved by our approach, under the resource contention situation.
Abstract: Erasure codes increase the reliability of storage systems by protecting them against disk failures. The recovery from disk failure affects the Program/Erase (P/E) cycles which may accelerate the Solid State Drive (SSD)’s wear-out. The wear-out problem (endurance limit) is one of the main reliability concerns about SSDs. By increasing the number of P/E cycles, SSDs reach to their endurance limit and become unreliable for storing data, moreover, they might be more vulnerable to the soft-error. Consequently, the reliability of SSDs is decreased by increasing the number of P/E cycles. Designing erasure codes with the purpose of reducing the P/E cycles therefore, enhances the reliability of SSD-based storage systems. This paper introduces an erasure code called Endurance Aware EVENODD (EA-EO), which modifies EVENODD code to reduce P/E cycles. This reduction postpones the wear-out, decreases the probability of soft-error, improves the endurance of SSDs, and finally enhances the reliability. The basic idea of the EA-EO is to provide the minimum dependency between data and parities in the coding pattern to decrease the number of P/E cycles. The simulation results indicate 13% improvement in endurance and 50% reduction in terms of XOR operations for EA-EO as compared with EVENODD.
Note: This paper has been retracted due to travel restrictions.
Note: This paper has been retracted due to travel restrictions.
Abstract: LVFS is a virtual scalable file storage system developed in response to a class of scientific data systems that over time continue to collect petabytes of data that begin to seriously impact the response time to user request services. The system has been operational in a real use case, the NASA MODIS Adaptive Processing System (MODAPS), and shown to double data throughput thanks to better performance and easier load balancing. The MODAPS operational life has been extended over a decade as of now and contains over four petabytes of data in over billions of files on over 500 different disks attached to multiple storage nodes. MODAPS is the processing system for de- livering calibrated Level 1 data from MODIS instruments on two NASA satellites, each containing 36 channel multi-spectral visible and infrared changes launched over a decade ago. These system’s life cycle operations are typical of many scientific instruments and experiments that continue to generate useful archival data well beyond their originial expected lifetime capabilities to meet current scientific user needs. The Level 1 Atmosphere Archive and Distribution System (LAADS) is responsible for distribution of products produced by MODAPS. The LAADS Virtual File System (LVFS) has now replaced parts of LAADS and is responsible for the read only distribution of all LAADS data to the public. In this paper, we describe the unique design of LVFS and, additionally, describe our ongoing work to incorporate a Distributed Hash- based architecture into the LVFS design to transform LVFS into a full scientific storage architecture scalable to Exabyte sizes.
Abstract: Over the past decade, I/O is has been a limiting factor for extreme scale parallel computing even though there has been substantial growth in the amount of data produced by parallel scientific applications. The datasets usually grow incrementally to massive sizes of the order of terabytes and petabytes. As such, the storage of such datasets, typically modelled as multidimensional arrays, requires efficient dynamic storage schemes where the array is allowed to arbitrary extend the bounds of the dimensions. This paper introduces PEXTA, a new parallel I/O model for the Global Array Toolkit. PEXTA provides the necessary APIs for explicit transfer between the memory resident global array and its secondary storage counterpart but also allows the persistent array to be extended on any dimension without compromising on the access time of an element or sub-array elements. Such a feature currently exists in the Hierarchical Data Format version 5 (HDF5) and parallel HDF5. However, extending the bound of a dimension in the HDF5 array file can be unusually expensive in time. Extensions, in our storage scheme for parallel dense array files, can still be performed while still accessing elements of the array much faster than parallel HDF5. We illustrate the PEXTA APIs with three applications; an out-of-core matrix-matrix multiplication, lattice Boltzmann simulation and Molecular Dynamics of Lennard Jones System.
Abstract: I/O has become one of the determining factors for HPC applications’ performance. Understanding an application’s I/O activities requires a specialized tool that provides a multi- level view of the I/O function flow. We have developed a tracing framework, called Recorder. Recorder captures I/O function calls at all layers of the parallel I/O stack. It is completely transparent to user and requires no alteration to source code. In this paper, we show the effectiveness in understanding and diagnose for I/O performance problem of two I/O benchmarks running on a leading edge HPC platform. Recorder shows the differences of running code using unoptimized and optimized setting or different I/O modes such as independent or collective mode. We believe that a multi-level I/O tracing framework will be benificial to both end users and library developers in understanding the I/O activities of applications and I/O library implementation.
Abstract: Distributed object-based storage models are an increasingly popular alternative to traditional block-based or file-based storage abstractions in large-scale storage systems. Object-based storage models store and access data in discrete, byte-addressable containers to simplify data management and cleanly decouple storage systems from underlying hardware resources. Although many large-scale storage systems share common goals of performance, scalability and fault tolerance, their underlying object storage models are typically tailored to specific use cases and semantics, making it difficult to reuse them in other environments and leading to unnecessary fragmentation of data center storage facilities. In this paper, we investigate a number of popular data models used in cloud storage, big data and High-Performance Computing (HPC) storage and describe the unique features that distinguish them. We then describe three representative use cases: a POSIX file system name space, a column-oriented key/value database and an HPC application checkpoint, and investigate the storage functionality they require. Next, we describe our proposed data model and show how our approach provides a unified solution for the previously described use cases.