Grand Challenge Problem:

Data Acquisition and Analysis

in Structural Biology

Gregor von Laszewski
Ian Foster
{gregor,itf}@mcs.anl.gov
Mary Westbrook
mwestbrook@anl.gov
Last updated: 05/14/1997

Introduction

The Structural Biology Center (SBC) at Argonne National Laboratory's Advanced Photon Source (APS) is a national user facility for the acquisition, processing, and archiving of x-ray diffraction data from crystals of biological large molecules: proteins, nucleic acids, viruses, and complexes of these large molecules. These studies permit molecular biologists to understand the structural details-at molecular and atomic levels-of the chemical activities and functions of these extremely complex objects which are at the very basis of life and health[SC97]. The experimental setup used for the research by the structural biologists is depicted in Figure 1. From the Advanced Photon Source (APS) a beam is redirected into the experimental setup. The crystal which is to be examined is placed in the path of the beam in front of a detector device. The detector recognizes particles which are emitted upon impact on the crystal on the detector screen. This produces one image. During the experiment the crystal sample is rotated. Therefore, a series of images are collected which allow during post-processing to construct a detailed analysis of the structure of the crystal. One result of the data-analysis is a three dimensional view of the crystal in order to study the properties of the sample.

Question: The detector is named mosaic. Does this indicate that we have here already a decomposition of the picture. Is this picture later plugged together. Maybe if the right decomposition is used this might not be necessary.


Figure 1: A more detailed picture of the insertion device.

This image displays the beam and the display

Figure 2: The experimental setup.
Each of the images is stored on a mass storage system (RAID - disks) in order to enable post-processing of the data (Figure 2). Each of the images is about 18MByte big. During an experiment about 250 to 1000 images are generated and recorded on the mass storage unit. The detection unit can transfer data with a maximum of about 10MB/s. Thus, it is to be expected to receive an image about every 1.8 seconds. Thus, all attached hardware and software has to be designed in such a way that it can sustain this constant data stream. In order to achieve this throughput on a mass storage unit, the images produced are striped across the multiple RAID arrays. From the computational science point of view the following observations are essential for the development and deployment of a high performance infrastructure supporting the experiments conducted by the experts in their field:
As consequence from the observations above, it is clear that the use of a high performance computing facility as available at Argonne National Laboratory, MCS, can help in some of the areas indicated above. Figure 3 shows the network structure as used in the experiments. The post-processing of the data will take place in the SP2 which is connected via an OC-3 ATM to the APS experiment setup.
(E.g. the File server which is connected to the RAID Array where the images are stored).


Figure 2: The network configuration

High Performance Computing

To support the fast development of a functional high performance computing infrastructure supporting the experiments conducted ate the structural biology center the following analysis can be helpful as well as direct future software development on the side of the computational science expert providing the d*trek software.

A three step approach is suggested which allows on the one hand the rapid development of a prototype as well as necessary software
enhancements in order to cover issues like load imbalance on a massively parallel computer.
The steps involve

Details:

We notice that there are two potential domains. One defined by the representation of the image, the other defined by the structure of the mathematical model/computation imposed on the three dimensional image. Thus, we obtain the following domains: Problem: We do not know o much about the algorithm itself to specify a domain decomposition.

Loadbalanceing

Loadbalancing can be one of the problems which results in the inefficient use of a massively parallel computing system.
The phrase: "I can be only as fast as my slowest component" is to be taken literally. The proposed approach will avoid load balancing problems due to the special usage of sophisticated but fast mapping and decomposition algorithms which minimize load imbalance. The usage of dynamical load balancing algorithms will be able to adapt automatically to the given problem domain.


Figure 4: From domain decomposition to mapping.

Experimental Results


Performance measuring of the data transfer between the RAID disk and the SP2

Potential problems: /tmp is full, shared space. other experiment uses connection. unix priorities.

Table 1: Experimental results as of mm/dd/1997.

Size of Data
/18 MB
ftp transfer
time in s
compression
in s
uncompression 
in s
1
10
25
50

Transport of data to SP2 at MCS.

ftp compressed
ftp uncompressed (if it makes sense)

References

[0] This page is available at http://www.mcs.anl.gov/gregor/projects
SBC Control System

[1] M.L. Westbrook, T.A. Coleman, R.T. Daly, J.W. Pflugrath, Data Acquisition and Analysis at the Structural Biology Center, In Proceedings IUCr Computing School, Poly Crystal Book Service. In the press, 1996

[2] http://www.sbc.anl.gov, An introduction to the Structural Biology Center

[3] http://www.sbc.anl.gov/present/sc97.html, An abstract for a submission to Supercomputing.

[4] [Not available, to be orderd] Methods in Enzymology, Volume 276, Colowick,
Macromolecular Chrystalography, Part A. 2 books at each $99.00

[5] Gregor von Laszewski, Four dimensional data analysis in atmospheric science and its implications on a metacomputing environment, Ph.D. Thesis, Syracuse University, Syracuse, DAO at NASA Goddard Space Flight Center, Greenbelt, MD, 1996