Using Masks to Improve Compression of Big Data in Scientific Applications

September 23, 2013

Big data is transforming science and engineering. But it is also raising major problems: how to store, access, and move that data efficiently. Data compression seems a logical answer to this problem; unfortunately, little work has been done on compression techniques for high precision scientific data.

To address this problem, researchers in Argonne’s Mathematics and Computer Science Division have developed a masking technique that leads to a better compression ratio and higher throughput for scientific applications.

Most scientific applications use IEEE floating-point representation, and floating-point data usually involves high irregularity (entropy); see Fig. 1.  Entropy is the enemy of compression. Thus, the question arises: How can one increase the regularity of scientific datasets?

The solution that the Argonne team has devised involves applying binary masks to the data, so that each value is transformed into a highly regular bit sequence. To avoid what the researchers call the “silly case” (requiring storage of one mask for each value), they applied the same mask to small blocks of contiguous data, that usually exhibit close values.

Still, not all went well. “Our initial experiments showed that the masks increased the regularity in some parts of the floating-point data, but the less significant bits were kept highly irregular,” said Franck Cappello, senior computer scientist and one of the PIs on the investigation. “Storing masks when they provide no benefit is a waste of storage. Instead, we decided to mask and compress only a part of the floating-point values.”

That approach worked. The compression ratio was increased by 15% compared with compressing the plain dataset. Moreover, even though the masking process takes significant time, the masking and compression speed of the approach improved substantially compared to direct compression, by avoiding highly entropy parts of the data.

To demonstrate that the high compression ratios were not due to a high degree of homogeneity in the initial conditions of the experiments, the researchers also ran an experiment in which they compressed the temperature variable of a long hurricane simulation.

“The results show that although the compressed size increases slightly during the first hours of the simulation, the irregularity of the datasets stabilizes, and our approach still guarantees a high compression ratio,” said Leonardo Bautista Gomez, a postdoctoral appointee working with Cappello on the project.

The two researchers presented their work, “Improving Floating-Point Compression through Binary Masks,” at the IEEE International Conference on Big Data 2013, held in Santa Clara, CA, in early October.