Grand Challenge Problem:
Data Acquisition and Analysis
in Structural Biology
Gregor von Laszewski
Ian Foster
{gregor,itf}@mcs.anl.gov
Mary Westbrook
mwestbrook@anl.gov
Last updated: 05/14/1997
Introduction
The Structural Biology Center (SBC) at Argonne National Laboratory's
Advanced Photon Source (APS) is a national user facility for the
acquisition, processing, and archiving of x-ray diffraction data from
crystals of biological large molecules: proteins, nucleic acids, viruses,
and complexes of these large molecules. These studies permit molecular
biologists to understand the structural details-at molecular and atomic
levels-of the chemical activities and functions of these extremely complex
objects which are at the very basis of life and health[SC97].
The experimental setup used for the research by the structural
biologists is depicted in Figure 1. From the Advanced Photon Source
(APS) a beam is redirected into the experimental setup. The crystal
which is to be examined is placed in the path of the beam in front of a
detector device. The detector recognizes particles which are emitted
upon impact on the crystal on the detector screen. This produces one
image. During the experiment the crystal sample is rotated. Therefore,
a series of images are collected which allow during post-processing to
construct a detailed analysis of the structure of the crystal. One
result of the data-analysis is a three dimensional view of the crystal
in order to study the properties of the sample.
Question: The detector is named mosaic. Does this
indicate that we have here already a decomposition of the picture. Is this
picture later plugged together. Maybe if the right decomposition is used
this might not be necessary.
Figure 1: A more detailed picture of the insertion device.
Figure 2: The experimental setup.
Each of the images is stored on a mass storage system (RAID - disks) in
order to enable post-processing of the data (Figure 2).
Each of the images is about 18MByte big. During an experiment about 250
to 1000 images are generated and recorded on the mass storage unit. The
detection unit can transfer data with a maximum of about 10MB/s. Thus,
it is to be expected to receive an image about every 1.8 seconds. Thus,
all attached hardware and software has to be designed in such a way that
it can sustain this constant data stream.
In order to achieve this throughput on a mass storage unit, the images
produced are striped across the multiple RAID arrays.
From the computational science point of view the following observations
are essential for the
development and deployment of a high performance infrastructure supporting
the experiments conducted
by the experts in their field:
-
Irreversibility of the experiment: Due to the intensity of the beam
it is likely, that damages to the crystal sample occur. The consequence
is that the experiment is conducted in a very short time. Thus, data gathered
by measuring instruments has to be collected and stored as quickly as possible.
-
Data-volume of the experiment: The data gathered form the experiment
is large. Each frame captured during an experiment is about 18MBytes big.
At present about 250 -1000of the images are gathered for each experiment.
-
Fast data analysis: Because there is a need for real time data analysis
during the experiment, it is desirable to develop a rapid data analysis
system involving state of the art hardware and software in order to steer
the experiment.
-
Post-processing of the data: Post-processing of the data with more elaborate
analysis techniques is necessary to provide the user with information which
can not be achieved during the experiment due to the high overhead of the
software used.
As consequence from the observations above, it is clear that the use of
a high performance computing facility as available at Argonne National
Laboratory, MCS, can help in some of the areas indicated above. Figure
3 shows the network structure as used in the experiments. The post-processing
of the data will take place in the SP2 which is connected via an OC-3 ATM
to the APS experiment setup.
(E.g. the File server which is connected to the RAID Array where the images
are stored).

Figure 2: The network configuration
High Performance Computing
To support the fast development of a functional high performance computing
infrastructure supporting the experiments conducted ate the structural
biology center the following analysis can be helpful as well as direct
future software development on the side of the computational science expert
providing the d*trek software.
A three step approach is suggested which allows on the one hand the rapid
development of a prototype as well as necessary software
enhancements in order to cover issues like load imbalance on a massively
parallel computer.
The steps involve
-
The definition of the domain
-
The domain decomposition
-
The mapping of the components to processes
-
The mapping of processes to processors.
Details:
We notice that there are two potential domains. One defined by the representation
of the image, the other defined by the structure of the mathematical model/computation
imposed on the three dimensional image. Thus, we obtain the following domains:
-
Domain:
- The image data domain is defined by the three dimensional view of the
crystal by combining the images gathered during the experiment.
- The computational domain is defined by the analysis method chosen to
derive the three dimensional image
Problem: We do not know o much about the algorithm
itself to specify a domain decomposition.
-
Domain decomposition.
Instead of choosing a striped or blocked decomposition on the three dimensional
domain we suggest to rewrite the software components in such a way that
they can perform on a small three dimensional cube of the total domain.
This way, it is possible to postpone the decision which processor is responsible
for which part of the domain as long as possible. Figure .. shows the suggested
domains. The stripped decompositions are a special case of the block decomposition.
A detailed code analysis will revel which decomposition should be chosen.
We strongly suggest to support the block decomposition in order to increase
the flexibility of the approach.
-
Mapping the domain to processes. We suggest to assign for each component
of the domain its own thread/process. This will enable a much more flexible
handling during the mapping of the processes to processors in order to
avoid issues related to load imbalance.
-
Mapping the processes to the processors.
The processes are mapped to available processors. Each processor
can be responsible for one or more components of the domain.
Loadbalanceing
Loadbalancing can be one of the problems which results in the inefficient
use of a massively parallel computing system.
The phrase: "I can be only as fast as my slowest component" is to
be taken literally. The proposed approach will avoid load balancing problems
due to the special usage of sophisticated but fast mapping and decomposition
algorithms which minimize load imbalance. The usage of dynamical load
balancing algorithms will be able to adapt automatically to the given
problem domain.
Figure 4: From domain decomposition to mapping.
Experimental Results
Performance measuring of the data transfer between the RAID disk and
the SP2
Potential problems: /tmp is full, shared space. other
experiment uses connection. unix priorities.
Table 1: Experimental results as of mm/dd/1997.
Size of Data
/18 MB
|
ftp transfer
time in s
|
compression
in s
|
uncompression
in s
|
|
| 1
|
|
|
|
|
| 10
|
|
|
|
|
| 25
|
|
|
|
|
| 50
|
|
|
|
|
Transport of data to SP2 at MCS.
ftp compressed
ftp uncompressed (if it makes sense)
References
[0] This page is available at http://www.mcs.anl.gov/gregor/projects
SBC Control System
[1] M.L. Westbrook,
T.A. Coleman, R.T. Daly, J.W. Pflugrath, Data
Acquisition and Analysis at the Structural Biology Center, In Proceedings
IUCr Computing School, Poly Crystal Book Service. In the press, 1996
[2] http://www.sbc.anl.gov, An introduction
to the Structural Biology Center
[3] http://www.sbc.anl.gov/present/sc97.html,
An abstract for a submission to Supercomputing.
[4] [Not available, to be orderd] Methods in Enzymology, Volume 276, Colowick,
Macromolecular Chrystalography, Part A. 2 books at each $99.00
[5] Gregor von Laszewski, Four dimensional data analysis in atmospheric
science and its implications on a metacomputing environment, Ph.D. Thesis,
Syracuse University, Syracuse, DAO at NASA Goddard Space Flight Center,
Greenbelt, MD, 1996