Data
The Darshan Data Repository
The Darshan data repository is a collection of anonymized log files that summarize the I/O characteristics of production scientific computing applications. The permanent home page for this data repository is http://www.mcs.anl.gov/darshan/data. At that link you can also find source code, information, publications, mailing lists, and other resources related to the Darshan characterization tool that was used to collect these log files.
Acknowledging use of the Darshan Data Repository
We would love to hear if you find this data helpful. Please let us know via the Darshan mailing list or by contacting Phil Carns or Kevin Harms directly.
If you use the data in a publication, please use the following citation that describes how the data was captured:
Philip Carns, Kevin Harms, William Allcock, Charles Bacon, Robert Latham, Samuel Lang, and Robert Ross. Understanding and improving computational science storage access through continuous characterization. In Proceedings of 27th IEEE Conference on Mass Storage Systems and Technologies (MSST), 2011.
Available Data Sets
The only data set that is currently available was captured on the Intrepid Blue Gene / P system at Argonne National Laboratory from January 1, 2010 to March 30, 2010. Darshan was enabled by default for all newly compiled MPI applications during that time period. From January to March it instrumented approximately 25% of all core hours consumed on Intrepid. The data set is therefore not a complete picture of all jobs that were run on the system. More detailed information about the coverage can be found in the MSST 2011 paper listed in the previous section.
About the log files
- There is a single log file for each MPI job that was instrumented (see following sections for tips on how to download multiple log files).
- In order to process these log files you must use Darshan 2.1.1 or newer, or else an svn trunk version of Darshan. The anonymized log files use bzip2 compression and a header format that is incompatible with previous Darshan releases.
- darshan-parser can be used to dump the contents of a log file, or darshan-job-summary.pl can be used to produce a graphical summary. The darshan-logutils.[ch] code provides an example of how to access the log files from a C program.
- IMPORTANT: please observe the warnings that appear at the top of the darshan-parser output for each file! Some characterization features may be missing or unreliable depending on the version of Darshan that was used to capture that particular log file. Multiple Darshan log versions are present in the data set. You can find more about the limitations of each log version in the darshan_log_print_version_warnings() function in darshan-logutils.c.
- The following wiki describes the data fields that are included in each Darshan log: http://wiki.mcs.anl.gov/Darshan/index.php/Guide_to_darshan-parser_output. Note in particular that if the rank field is -1, then the corresponding counters for that file represent aggregate statistics on a file that was shared across _all_ MPI ranks. If the rank field is 0 or larger then the corresponding counters refer to statistics gathered on a single process.
- The following fields have been anonymized in each log file and replaced with strings of numbers:
- job id
- uid
- exe (command line)
- file name suffix
- project name (annotated in the “metadata” portion of the header)
- Those fields have been anonymized in a consistent manner, however, so you can still (for example) group log files by project. We just do not provide the true name of the project.
Downloading the data set
All data can be accessed from the following ftp directory: ftp://ftp.mcs.anl.gov/pub/darshan/data
The data is divided up into separate files for each instrumented MPI run and organized into subdirectories based on the year/month/day that the data was collected. One simple way to perform a bulk download of the data is to use the “wget” utility. For example, in order to download all March 2010 logs from the Intrepid collection, you would use the following command:
wget -r ftp://ftp.mcs.anl.gov/pub/darshan/data/intrepid/2010/3
Some ftp clients may also be able to perform recursive downloads in a similar manner.
Contact Information
Darshan mailing list:
https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
Phil Carns:
carns (at) mcs.anl.gov
Kevin Harms:
harms (at) alcf.anl.gov