Software



DIY

Tess

Decaf

LowFive

Wilkins

MFA



DIY: Do-it-Yourself Analysis


An open-source package of scalable building blocks for data movement tailored to the needs of large-scale parallel analysis workloads

Installation (Linux, Mac, supercomputers, computing clusters):
Download DIY with the following command:

git clone https://github.com/diatomic/diy


and follow the instructions in the README.
Documentation can be found here.

Description:
Scalable, parallel analysis of data-intensive computational science relies on the decomposition of the analysis problem among a large number of data-parallel subproblems, the efficient data exchange among them, and data transport between them and the memory/storage hierarchy. The abstraction enabling these capabilities is block-based parallelism; blocks and their message queues are mapped onto processing elements (MPI processes or threads) and are migrated between memory and storage by the DIY runtime. Configurable data partitioning, scalable data exchange, and efficient parallel I/O are the main components of DIY. The current version of DIY has been completely rewritten to support distributed- and shared-memory parallel algorithms that can run both in- and out-of-core with the same code. The same program can be executed with one or more threads per MPI process and with one or more data blocks resident in main memory. Computational scientists, data analysis researchers, and visualization tool builders can all benefit from these tools.


DIY structure


Above: The DIY software structure. The master manages block placement in the memory/storage hieararchy and executes the block operations with one or more threads. The decomposer creates blocks with regularly defined decomposition patterns, and the assigner maps one or more blocks to an MPI process. Communication occurs between blocks using regular patterns. The I/O module provides collective and independent block I/O. Common tasks are provided by a set of ready-to-use algorithms.


DIY performance


DIY has demonstrated strong scaling out to 512K processes on existing machines in a diverse array of science and analysis codes, including cosmology, molecular dynamics, nuclear engineering, co-design, astrophysics, combustion, and synchrotron light source imaging. The figure above shows a benchmark of strong and weak scaling of parallel Delaunay tessellations, one of the libraries built on top of DIY. [SC14 paper]

More information and citing DIY:

Please refer to our LDAV'11 paper,
pdf bibtex
and our LDAV'16 paper,
pdf bibtex

Authors:
DIY is a collaboration between Tom Peterka of Argonne National Laboratory and Dmitriy Morozov of Lawrence Berkeley National Laboratory.



Tess: Parallel Delaunay and Voronoi Tessellation


An open-source package for parallelizing Delaunay and Voronoi tessellation over distributed-memory HPC architecture

Installation (Linux, Mac, supercomputers, computing clusters):
Download Tess with the following command:

git clone https://github.com/diatomic/tess2


and follow the instructions in the README.

Description:
Computing a Voronoi or Delaunay tessellation from a set of points is a core part of the analysis of many simulated and measured datasets: N-body simulations, molecular dynamics codes, and LIDAR point clouds are just a few examples. Such computational geometry methods are common in data analysis and visualization; but as the scale of simulations and observations surpasses billions of particles, the existing serial and shared-memory algorithms no longer suffice. A distributed-memory scalable parallel algorithm is the only feasible approach. Tess is a parallel Delaunay and Voronoi tessellation algorithm that automatically determines which neighbor points need to be exchanged among the subdomains of a spatial decomposition. Computing tessellations at scale performs poorly when the input data is unbalanced, which is why Tess uses k-d trees to evenly distribute points among processes. The running times are up to 100 times faster using k-d tree compared with regular grid decomposition. Moreover, in unbalanced data sets, decomposing the domain into a k-d tree is up to five times faster than decomposing it into a regular grid.


Tess performance


Above: Cosmology simulations are a prime example of severe load imbalance, as dark matter particles cluster into halos and voids whose densities can vary by six orders of magnitude. The strong scaling of parallel Delaunay tessellation improves by approximately 100X by decomposing blocks in a DIY k-d tree versus a regular grid.

More information and citing Tess:
Please refer to our SC14 paper,
pdf bibtex
and our SC16 paper,
pdf bibtex

Authors:
Tess is a collaboration between Tom Peterka of Argonne National Laboratory and Dmitriy Morozov of Lawrence Berkeley National Laboratory.



Decaf: Decoupled Dataflows


An open-source package of dataflow communication for in situ HPC data analysis workflows

Installation (Linux, Mac, supercomputers, computing clusters):
Download Decaf with the following command:

git clone https://github.com/tpeterka/decaf


and follow the instructions in the README.

Description:
Decaf is a dataflow system for the parallel communication of coupled tasks in an HPC workflow. The dataflow can perform arbitrary data transformations ranging from simply forwarding data to complex data redistribution. Decaf does this by allowing the user to allocate resources and execute custom code in the dataflow. All communication through the dataflow is efficient parallel message passing over MPI. The runtime for calling tasks is entirely message-driven; Decaf executes a task when all messages for the task have been received. Such a message-driven runtime allows cyclic task dependencies in the workflow graph, for example, to enact computational steering based on the result of downstream tasks. Decaf includes a simple Python API for describing the workflow graph. This allows Decaf to stand alone as a complete workflow system, but Decaf can also be used as the dataflow layer by one or more other workflow systems to form a heterogeneous task-based computing environment.


Decaf structure


Above: The Decaf software structure. Decaf is a dataflow middleware for coupling HPC simulation and analytics codes. Decaf includes distributed and configurable dataflow links, standard data redistribution patterns, flow control, and automatic data filtering.


Decaf performance


Decaf has demonstrated strong scaling out to 1280 processes on existing machines in numerous science and analysis codes, including cosmology, molecular dynamics, and light source imaging. The figure above shows a benchmark of strong scaling of an HPC workflow consisting of computational cosmology, parallel Voronoi tessellation, and parallel density estimation of gravitational lensing due to dark matter.

More information and citing Decaf:
Please refer to our ISAV'15 paper,
pdf bibtex
and our Cluster'16 paper,
pdf bibtex

Authors:
Decaf is a collaboration between Tom Peterka and Orcun Yildiz of Argonne National Laboratory and Matthieu Dreher formerly of Argonne National Laboratory.



LowFive: In Situ Data Transport


An open-source package of high-performance data transport for in situ HPC workflows

Installation (Linux, Mac, supercomputers, computing clusters):
Download LowFive with the following command:

git clone https://github.com/diatomic/LowFive


and follow the instructions in the README.

Description:
LowFive is a data transport layer based on the HDF5 data model, for in situ workflows. Executables using LowFive can communicate in situ (using in-memory data and MPI message passing), reading and writing traditional HDF5 files to physical storage, and combining the two modes. Minimal and often no source-code modification is needed for programs that already use HDF5. LowFive maintains deep copies or shallow references of datasets, configurable by the user. More than one task can produce (write) data, and more than one task can consume (read) data, accommodating fan-in and fan-out in the workflow task graph. LowFive supports data redistribution from n producer processes to m consumer processes.


LowFive example


Above: An example of three tasks coupled through the LowFive in situ data transport library. User task codes are unchanged, reading and writing what they think are files, when in fact LowFive is intercepting those I/O calls and redirecting them as MPI messages over the high-performance interconnect of the supercomputer. Tasks can elect to optionally still write to physical storage as well.


LowFive performance


Time to write/read grid and particles between 1 producer task and 1 consumer task, comparing LowFive file and memory modes, in a weak scaling regime clearly demonstrates the advantage of in situ data transport over file-based data transport (lower is better).

More information and citing LowFive:
Please refer to our IPDPS'23 paper,
pdf bibtex

Authors:
LowFive is a collaboration between Tom Peterka and Orcun Yildiz of Argonne National Laboratory and Dmitriy Morozov and Arnur Nigmetov of Lawrence Berkeley National Laboratory.



Wilkins: In Situ Workflow Management


An open-source in situ HPC workflow management system.

Installation (Linux, Mac, supercomputers, computing clusters):
Download Wilkins with the following command:

git clone https://github.com/orcunyildiz/wilkins


and follow the instructions in the README.

Description:
Wilkins is an in situ workflow system that is designed for ease-of-use while providing scalable and efficient execution of workflow tasks. Wilkins provides a flexible workflow description interface, employs a high-performance data transport layer based on HDF5, and supports tasks with disparate data rates by providing a flow control mechanism. Wilkins seamlessly couples scientific tasks that already use HDF5, without requiring task code modifications.


Wilkins workflow


Above: Logically, an in situ HPC workflow is modeled as a directed graph of tasks (parallel programs) and data dependencies (parallel communications). The graph does not need to be acyclic.


Wilkins molecular dynamics application Wilkins performance


Left: Molecular dynamics application using Wilkins couples the LAMMPS molecular dynamics code with a parallel feature detector that extracts diamond-shaped nucleated crystals. Because nucleation is a rare stochastic event, an ensemble of many LAMMPS instances needs to be executed. Right: Performance scales well with increasing the number of ensemble instances (horizontal is ideal).

More information and citing Wilkins:
Please refer to our arXiv paper,
pdf bibtex

Authors:
Wilkins is a collaboration between Tom Peterka and Orcun Yildiz of Argonne National Laboratory and Dmitriy Morozov and Arnur Nigmetov of Lawrence Berkeley National Laboratory.



MFA: Multivariate Functional Approximation


An open-source package for modeling scientific data with functional approximations based on high-dimensional multivariate B-spline and NURBS bases

Installation (Linux, Mac, supercomputers, computing clusters):
Download MFA with the following command:

git clone https://github.com/tpeterka/mfa


and follow the instructions in the README.

Description:
Scientific data may be transformed by recasting to a fundamentally different kind of data model than the discrete point-wise or element-wise datasets produced by computational models. In Multivariate Functional Approximation, or MFA, scientific datasets are redefined in a hypervolume of piecewise-continuous basis functions. Compared with existing discrete models, the continuous functional model can save space while affording many of the same spatiotemporal analyses without reverting back to the discrete form. The MFA model can represent numerous types of data because it is agnostic to the mesh, field, or discretization of the input dataset. Compared with existing discrete data models, the MFA model can enable many spatiotemporal analyses, without converting the entire dataset back to the original discrete form. The MFA often occupies less storage space than the original discrete data, providing some data reduction, depending on data complexity and intended usage. For example, noise may be intentionally smoothed using a small number of control points and high-degree basis functions; alternatively, high-frequency data features may be preserved with more control points and lower degree. Post hoc, the MFA enables analytical closed-form evaluation of points and derivatives, to high order, anywhere inside the domain, without being limited to the locations of the input data points.


MFA curves MFA tensor products


Above: Basis functions and formulation of 1-d curves at top and extension to n-d tensor products below.


MFA parallel performance


Above: MFA has been applied to the modeling of various scientific datasets and scaled on parallel clusters and supercomputers.

More information and citing MFA:

Please refer to our LDAV'18 paper,
pdf bibtex
our Cluster'19 paper
pdf bibtex
our ICCS'19 paper
pdf bibtex
or our CAD'20 paper
pdf bibtex

Authors:
MFA is a collaboration between Tom Peterka, David Lenz, Iulian Grindeanu, Vijay Mahadevan of Argonne National Laboratory; Youssef Nashed formerly of Argonne National Laboratory, and Raine Yeh and Xavier Tricoche of Purdue University.