Argonne National Laboratory

Bloomfish: A Highly Scalable Distributed K-mer Counting Framework

TitleBloomfish: A Highly Scalable Distributed K-mer Counting Framework
Publication TypeConference Paper
Year of Publication2017
AuthorsGao, T, Guo, Y, Wei, Y, Wang, B, Lu, Y, Cicotti, P, Balaji, P, Taufer, M
Conference NameICPADS IEEE International Conference on Parallel and Distributed Systems
Conference LocationShenzhen, China
AbstractK-mer counting is a fundamental operation in DNA research and genome analytics; its application includes estimating genome assembly, understanding similarities in genomic samples, and merging a newly processed genome with a reference genome. As the genome dataset becomes larger and larger, designing a highly optimized distributed-memory implementation becomes more and more important. Current distributed-memory solutions have two limitations: they have a high memory footprint, and they do not provide advanced optimizations for loading enormous genome datasets into memory. Based on these observations, we present Bloomfish, a distributed, memory-efficient, scalable solution to the limits of current work. To keep a low memory footprint, Bloomfish leverages the compact hash array design of the single-node Jellyfish system and the optimized workflow of the high-performance MapReduce framework Mimir. We have also codesigned Mimir’s I/O to efficiently load enormous datasets. We ran Bloomfish on the Tianhe-2 supercomputer with large sequence datasets (up to 24 TB). Our results show that Bloomfish achieves unprecedented scalability in genome analytics.