Argonne National Laboratory

Semantics-based Distributed I/O for mpiBLAST

TitleSemantics-based Distributed I/O for mpiBLAST
Publication TypeConference Paper
Year of Publication2008
AuthorsBalaji, P, Feng, W, Archuleta, J, Lin, H, Kettimuthu, R, Thakur, R, Ma, X
Conference NameHigh Performance Distributed Computing 2008 (HPDC 2008)
Conference LocationBoston, MA
Other NumbersANL/MCS-P1482-0108
AbstractBLAST is a widely used software toolkit for genomic sequence search. mpiBLAST is a freely available, open-source parallelization of BLAST that uses database segmentation to allow different worker processes to search (in parallel) unique segments of the database. After searching, the workers write their output to a filesystem. While mpiBLAST has been shown to achieve high performance in clusters with fast local filesystems, its I/O processing remains a concern for scalability, especially in systems having limited I/O capabilities such as distributed filesystems spread across a wide-area network. Thus, we present ParaMEDIC—a novel environment that uses applicationspecific semantic information to compress I/O data and improve performance in distributed environments. Specifically, for mpiBLAST, ParaMEDIC partitions worker processes into compute and I/O workers. Compute workers, instead of directly writing the output to the filesystem, process the output using semantic knowledge about the application to generate metadata and write the metadata to the filesystem. I/O workers, which physically reside closer to the actual storage, then process this metadata to re-create the actual output and write it to the filesystem. This approach allows ParaMEDIC to reduce I/O time, thus accelerating mpiBLAST by as much as 25-fold.