[1] Rob Latham, Chris Daley, Wei keng Liao, Kui Gao, Rob Ross, Anshu Dubey, and Alok Choudhary. A case study for scientific i/o: improving the flash astrophysics code. Computational Science & Discovery, 5(1):015001, 2012. [ bib | http ]
The FLASH code is a computational science tool for simulating and studying thermonuclear reactions. The program periodically outputs large checkpoint files (to resume a calculation from a particular point in time) and smaller plot files (for visualization and analysis). Initial experiments on BlueGene/P spent excessive time in input/output (I/O), making it difficult to do actual science. Our investigation of time spent in I/O revealed several locations in the I/O software stack where we could make improvements. Fixing data corruption in the MPI-IO library allowed us to use collective I/O, yielding an order of magnitude improvement. Restructuring the data layout provided a more efficient I/O access pattern and yielded another doubling of performance, but broke format assumptions made by other tools in the application workflow. Using new nonblocking APIs in the Parallel-NetCDF library allowed us to keep high performance and maintain backward compatibility. The I/O research community has studied a host of optimizations and strategies. Sometimes the challenge for applications is knowing how to apply these new techniques to production codes. In this case study, we offer a demonstration of how computational scientists, with a detailed understanding of their application, and the I/O community, with a wide array of approaches from which to choose, can magnify each other's efforts and achieve tremendous application productivity gains.

[2] Wesley Kendall, Jian Huang, Tom Peterka, Robert Latham, and Robert Ross. Toward a general i/o layer for parallel-visualization applications. Computer Graphics and Applications, IEEE, 31(6):6 -10, November-December 2011. [ bib | DOI ]
For large-scale visualization applications, the visualization community urgently needs general solutions for efficient parallel I/O. These parallel visualization solutions should center around design patterns and the related data-partitioning strategies, not file formats. From this respect, it's feasible to greatly alleviate I/O burdens without reinventing the wheel. For example, BIL (Block I/O Layer), which implements such a pattern, has greatly accelerated I/O performance for large-scale parallel particle tracing, a pervasive but challenging use case.

Keywords:
[3] Sriram Lakshminarasimhan, Neil Shah, Stephane Ethier, Scott Klasky, Rob Latham, Rob Ross, and Nagiza Samatova. Compressing the incompressible with isabela: In-situ reduction of spatio-temporal data. In Emmanuel Jeannot, Raymond Namyst, and Jean Roman, editors, Euro-Par 2011 Parallel Processing, volume 6852 of Lecture Notes in Computer Science, pages 366-379. Springer Berlin / Heidelberg, September 2011. 10.1007/978-3-642-23400-2_34. [ bib | http ]
Modern large-scale scientific simulations running on HPC systems generate data in the order of terabytes during a single run. To lessen the I/O load during a simulation run, scientists are forced to capture data infrequently, thereby making data collection an inherently lossy process. Yet, lossless compression techniques are hardly suitable for scientific data due to its inherently random nature; for the applications used here, they offer less than 10% compression rate. They also impose significant overhead during decompression, making them unsuitable for data analysis and visualization that require repeated data access. To address this problem, we propose an effective method for In-situ Sort-And-B-spline Error-bounded Lossy Abatement ( ISABELA ) of scientific data that is widely regarded as effectively incompressible. With ISABELA , we apply a preconditioner to seemingly random and noisy data along spatial resolution to achieve an accurate fitting model that guarantees a ≥ 0.99 correlation with the original data. We further take advantage of temporal patterns in scientific data to compress data by ≈ 85%, while introducing only a negligible overhead on simulations in terms of runtime. ISABELA significantly outperforms existing lossy compression methods, such as Wavelet compression. Moreover, besides being a communication-free and scalable compression technique, ISABELA is an inherently local decompression method, namely it does not decode the entire data, making it attractive for random access.

[4] Florin Isaila, Javier Garcia Blas, Jesus Carretero, Robert Latham, and Robert Ross. Design and evaluation of multiple-level data staging for blue gene systems. Parallel and Distributed Systems, IEEE Transactions on, 22(6):946 -959, June 2011. [ bib | DOI | .pdf ]
Parallel applications currently suffer from a significant imbalance between computational power and available I/O bandwidth. Additionally, the hierarchical organization of current Petascale systems contributes to an increase of the I/O subsystem latency. In these hierarchies, file access involves pipelining data through several networks with incremental latencies and higher probability of congestion. Future Exascale systems are likely to share this trait. This paper presents a scalable parallel I/O software system designed to transparently hide the latency of file system accesses to applications on these platforms. Our solution takes advantage of the hierarchy of networks involved in file accesses, to maximize the degree of overlap between computation, file I/O-related communication, and file system access. We describe and evaluate a two-level hierarchy for Blue Gene systems consisting of client-side and I/O node-side caching. Our file cache management modules coordinate the data staging between application and storage through the Blue Gene networks. The experimental results demonstrate that our architecture achieves significant performance improvements through a high degree of overlap between computation, communication, and file I/O.

[5] Philip Carns, Kevin Harms, William Allcock, Charles Bacon, Samuel Lang, Robert Latham, and Robert Ross. Understanding and improving computational science storage access through continuous characterization. Mass Storage Systems and Technologies, IEEE / NASA Goddard Conference on, 0:1-14, May 2011. [ bib | DOI ]
[6] Seung Woo Son, Samuel Lang, Robert Latham, Robert Ross, and Rajeev Thakur. Reliable MPI-IO through Layout-Aware Replication. In Proc. of the 7th IEEE International Workshop on Storage Network Architecture and Parallel I/O (SNAPI 2011), May 2011. [ bib ]
[7] Philip Carns, Kevin Harms, William Allcock, Charles Bacon, Samuel Lang, Robert Latham, and Robert Ross. Understanding and Improving Computational Science Storage Access through Continuous Characterization. Trans. Storage, 7(3), 2011. [ bib ]
[8] Samuel Lang, Philip Carns, Robert Latham, Robert Ross, Kevin Harms, and William Allcock. I/O performance challenges at leadership scale. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 40:1-40:12, New York, NY, USA, November 2009. ACM. [ bib | DOI | http ]
[9] Wes Kendall, M. Glatter, J. Huang, Tom Peterka, Rob Latham, and Robert B. Ross. Terascale data organization for discovering multivariate climatic trends. In Proceedings of SC2009: High Performance Networking and Computing, November 2009. [ bib ]
[10] Tom Peterka, Hongfeng Yu, Robert Ross, Kwan-Liu Ma, and Rob Latham. End-to-End Study of Parallel Volume Rendering on the IBM Blue Gene/P. In International Conference on Parallel Processing (ICPP 09), Vienna, Austria, September 2009. [ bib | .pdf ]
In addition to their crucial role as simulation engines, modern supercomputers can be harnessed for scientific visualization. Their massive parallelism, high-performance storage, and low-latency high-bandwidth interconnects can mitigate the expanding size and complexity of scientific datasets and prepare for in situ visualization of these data. In prior research, we tested parallel volume rendering on the IBM Blue Gene/P (BG/P) at Argonne National Laboratory, work that this paper extends to the largest-scale visualization system to date. We measure performance of disk I/O, rendering, and compositing on larger data and images and evaluate bottleneck locations with respect to the volume rendering algorithm, BG/P-specific architecture, and parallel file system. The results, with core counts to 32K, data sizes to 44803 elements, and image sizes to 40962 pixels, affirm that a distributed-memory high-performance computing architecture such as BG/P is a scalable platform for large visualization problems. To allay compositing bottlenecks at large system scale, we limit the number of compositing cores when many small messages are exchanged. This approach extends the performance and scalability of direct-send compositing. After improving compositing, I/O is the main bottleneck, so we study I/O performance in detail, including collective reading of multivariate netCDF files directly through the visualization. To put the algorithm s bottlenecks into context, we compare the I/O and compositing performance to benchmarks of similar access patterns.

[11] Robert Ross, Robert Latham, William Gropp, Ewing Lusk, and Rajeev Thakur. Processing MPI datatypes outside MPI. Lecture Notes in Computer Science, pages 42-53, September 2009. [ bib | DOI | http | .pdf ]
The MPI datatype functionality provides a powerful tool for describing structured memory and file regions in parallel applications, enabling noncontiguous data to be operated on by MPI communication and I/O routines. However, no facilities are provided by the MPI stan- dard to allow users to efficiently manipulate MPI datatypes in their own codes.

We present MPITypes, an open source, portable library that enables the construction of efficient MPI datatype processing routines outside the MPI implementation. MPITypes enables programmers who are not MPI implementors to create efficient datatype processing routines. We show the use of MPITypes in three examples: copying data between user buffers and a “pack” buffer, encoding of data in a portable format, and transpacking. Our experimental evaluation shows that the implementation achieves rates comparable to existing MPI implementations.

[12] Javier García Blas, Florin Isaila, Jesús Carretero, Robert Latham, and Robert Ross. Multiple-level MPI file write-back and prefetching for Blue Gene systems. In Proc. of the 16th European PVM/MPI User's Group Meeting (Euro PVM/MPI 2009), September 2009. [ bib ]
[13] Kui Gao, Wei keng Liao, Arifa Nisar, Alok Choudhary, Robert Ross, and Robert Latham. Using Subfiling to Improve Programming Flexibility and Performance of Parallel Shared-file I/O. In Proceedings of the International Conference on Parallel Processing, Vienna, Austria, September 2009. [ bib ]
[14] Kui Gao, Wei keng Liao, Alok Choudhary, Robert Ross, and Robert Latham. Combining I/O Operations for Multiple Array Variables in Parallel NetCDF. In Proceedings of the Workshop on Interfaces and Architectures for Scientific Data Storage, held in conjunction with the the IEEE Cluster Conference, New Orleans, Louisiana, September 2009. [ bib ]
[15] Philip Carns, Robert Latham, Robert Ross, Kamil Iskra, Samuel Lang, and Katherine Riley. 24/7 Characterization of Petascale I/O Workloads. In Proceedings of 2009 Workshop on Interfaces and Architectures for Scientific Data Storage, September 2009. [ bib ]
Developing and tuning computational science applications to run on extreme scale systems are increasingly complicated processes. Challenges such as managing memory access and tuning message-passing behavior are made easier by tools designed specifically to aid in these processes. Tools that can help users better understand the behavior of their application with respect to I/O have not yet reached the level of utility necessary to play a central role in application development and tuning. This deficiency in the tool set means that we have a poor understanding of how specific applications interact with storage. Worse, the community has little knowledge of what sorts of access patterns are common in today's applications, leading to confusion in the storage research community as to the pressing needs of the computational science community. This paper describes the Darshan I/O characterization tool. Darshan is designed to capture an accurate picture of application I/O behavior, including properties such as patterns of access within files, with the minimum possible overhead. This characterization can shed important light on the I/O behavior of applications at extreme scale. Darshan also can enable researchers to gain greater insight into the overall patterns of access exhibited by such applications, helping the storage community to understand how to best serve current computational science applications and better predict the needs of future applications. In this work we demonstrate Darshan's ability to characterize the I/O behavior of four scientific applications and show that it induces negligible overhead for I/O intensive jobs with as many as 65,536 processes.

[16] Nawab Ali, Philip Carns, Kamil Iskra, Dries Kimpe, Samuel Lang, Robert Latham, Robert Ross, Lee Ward, and P. Sadayappan. Scalable I/O Forwarding Framework for High-Performance Computing Systems. In Proceedings of IEEE Conference on Cluster Computing, New Orleans, LA, September 2009. [ bib ]
[17] Sam Lang, Robert Latham, Dries Kimpe, and Robert Ross. Interfaces for Coordinated Access in the File System. In Proceedings of 2009 Workshop on Interfaces and Architectures for Scientific Data Storage, September 2009. [ bib ]
Distributed applications routinely use the file system for coordination of access and often rely on POSIX consistency semantics or file system lock support for coordination. In this paper we discuss the types of coordination many distributed applications perform and the coordination model they are restricted to using with locks. We introduce an alternative coordination model in the file system that uses extended attribute support in the file system to provide atomic operations on serialization variables. We demonstrate the usefulness of this approach for a number of coordination patterns common to distributed applications.

[18] Alok Choudhary, Wei keng Liao, Kui Gao, Arifa Nisar, Robert Ross, Rajeev Thakur, and Robert Latham. Scalable I/O and Analytics. Journal of Physics: Conference Series, 180(012048), August 2009. Proceedings of SciDAC conference, 14-18 June 2009, San Diego, California, USA. [ bib ]
[19] Florin Isaila, Javier Garcia Blas, Jesus Carretero, Robert Latham, Samuel Lang, and Robert Ross. Latency Hiding File I/O for Blue Gene Systems. In CCGRID '09: Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, pages 212-219, Washington, DC, USA, 2009. IEEE Computer Society. [ bib | DOI ]
[20] Justin M. Wozniak, Bryan Jacobs, Robert Latham, Sam Lang, Seung Woo Son, and Robert Ross. Implementing reliable data structures for mpi services in high component count systems. In Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 321-322, Berlin, Heidelberg, 2009. Springer-Verlag. [ bib | DOI | http ]
[21] Robert Latham, William Gropp, Robert Ross, and Rajeev Thakur. Extending the MPI-2 Generalized Request Interface. Lecture Notes in Computer Science, pages 223-232, October 2007. (EuroPVM/MPI 2007). [ bib | DOI | http | .pdf ]
The MPI-2 standard added a new feature to MPI called generalized requests. Generalized requests allow users to add new nonblocking operations to MPI while still using many pieces of MPI infrastructure such as request objects and the progress notification routines (MPI_Test, MPI_Wait). The generalized request design as it stands, however, has deficiencies regarding typical use cases. These deficiencies are particularly evident in environments that do not support threads or signals, such as the leading petascale systems (IBM Blue Gene/L, Cray XT3 and XT4). This paper examines these shortcomings, proposes extensions to the interface to overcome them, and presents implementation results.

[22] Robert Latham, Robert Ross, and Rajeev Thakur. Implementing MPI-IO Atomic Mode and Shared File Pointers Using MPI One-Sided Communication. International Journal of High Performance Computing Applications, 21(2):132-143, 2007. [ bib | DOI | arXiv | http | .pdf ]
The ROMIO implementation of the MPI-IO standard provides a portable infrastructure for use on top of a variety of underlying storage targets. These targets vary widely in their capabilities, and in some cases additional effort is needed within ROMIO to support all MPI-IO semantics. Two aspects of the interface that can be problematic to implement are MPI-IO atomic mode and the shared file pointer access routines. Atomic mode requires enforcing strict consistency semantics, and shared file pointer routines require communication and coordination in order to atomically update a shared resource. For some file systems, native locks may be used to implement these features, but not all file systems have lock support. In this work, we describe algorithms for implementing efficient mutex locks using MPI-1 and the one-sided capabilities from MPI-2. We then show how these algorithms may be used to implement both MPI-IO atomic mode and shared file pointer methods for ROMIO without requiring any features from the underlying file system. We show that these algorithms can outperform traditional file system lock approaches. Because of the portable nature of these algorithms, they are likely useful in a variety of situations where distributed locking or coordination is needed in the MPI-2 environment.

[23] Robert Latham, Robert Ross, and Rajeev Thakur. Can MPI be used for persistent parallel services? Lecture Notes in Computer Science, 4192:275-284, September 2006. [ bib | http | .pdf ]
[24] Hao Yu, R. K. Sahoo, C. Howson, George. Almasi, J. G. Castanos, M. Gupta, Jose. E. Moreira, J. J. Parker, T. E. Engelsiepen, Robert Ross, Rajeev Thakur, Robert Latham, and W. D. Gropp. High performance file I/O for the bluegene/l supercomputer. In Proceedings of the 12th International Symposium on High-Performance Computer Architecture (HPCA-12), pages 187 - 196, February 2006. [ bib | DOI | .pdf ]
Keywords: Blue Gene/L supercomputer; General Parallel File System; MPI; data-intensive application; functional partitioning design; hierarchical partitioning; high performance file I/O; parallel HDF5; parallel I/O benchmark; parallel NetCDF; parallel file I/O architecture; application program interfaces; benchmark testing; file organisation; message passing; parallel architectures; parallel machines;
[25] Rajeev Thakur, Robert Ross, and Robert Latham. Implementing byte-range locks using mpi one-sided communication. Lecture Notes in Computer Science, September 2005. [ bib | .pdf ]
[26] Robert Latham, Robert Ross, and Rajeev Thakur. Implementing MPI-IO shared file pointers without file system support. Lecture Notes in Computer Science, September 2005. Selected as one of five Best Papers. Superseded by IJHPCA paper. [ bib | http | .pdf ]
The ROMIO implementation of the MPI-IO standard provides a portable infrastructure for use on top of any number of different underlying storage targets. These targets vary widely in their capabilities, and in some cases additional effort is needed within ROMIO to support all MPI-IO semantics. The MPI-2 standard defines a class of file access routines that use a shared file pointer. These routines require communication internal to the MPI-IO implementation in order to allow processes to atomically update this shared value. We discuss a technique that leverages MPI-2 one-sided operations and can be used to implement this concept without requiring any features from the underlying file system. We then demonstrate through a simulation that our algorithm adds reasonable overhead for independent accesses and very small overhead for collective accesses.

[27] Robert Ross, Robert Latham, William Gropp, Rajeev Thakur, and Brian Toonen. Implementing MPI-IO atomic mode without file system support. In Proceedings of CCGrid 2005, May 2005. Superseded by IJHPCA paper. [ bib | .pdf ]
[28] Rob Latham, Rob Ross, and Rajeev Thakur. The impact of file systems on MPI-IO scalability. Lecture Notes in Computer Science, 3241:87-96, September 2004. [ bib | http | .pdf ]
As the number of nodes in cluster systems continues to grow, leveraging scalable algorithms in all aspects of such systems becomes key to maintaining performance. While scalable algorithms have been applied successfully in some areas of parallel I/O, many operations are still performed in an uncoordinated manner. In this work we consider, in three file system scenarios, the possibilities for applying scalable algorithms to the many operations that make up the MPI-IO interface. From this evaluation we extract a set of file system characteristics that aid in developing scalable MPI-IO implementations.

Keywords: scalability analysis, MPI-IO, pario-bib
[29] Jianwei Li, Wei keng Liao, Alok Choudhary, Robert Ross, Rajeev Thakur, William Gropp, Rob Latham, Andrew Siegel, Brad Gallagher, and Michael Zingale. Parallel netCDF: A high-performance scientific I/O interface. In Proceedings of SC2003: High Performance Networking and Computing, SC '03, pages 39-, Phoenix, AZ, November 2003. IEEE Computer Society Press. [ bib | DOI | .pdf ]
Dataset storage, exchange, and access play a critical role in scientific applications. For such purposes netCDF serves as a portable, efficient file format and programming interface, which is popular in numerous scientific application domains. However, the original interface does not provide an efficient mechanism for parallel data storage and access.

In this work, we present a new parallel interface for writing and reading netCDF datasets. This interface is derived with minimal changes from the serial netCDF interface but defines semantics for parallel access and is tailored for high performance. The underlying parallel I/O is achieved through MPI-IO, allowing for substantial performance gains through the use of collective I/O optimizations. We compare the implementation strategies and performance with HDF5. Our tests indicate programming convenience and significant I/O performance improvement with this parallel netCDF (PnetCDF) interface.

Keywords: parallel I/O interface, netCDF, MPI-IO, pario-bib

This file was generated by bibtex2html 1.96.