John Michalakes
DRAFT
RSL is a Fortran library and run-time system for efficient and straightforward implementation of finite-difference weather models with nesting on parallel computers. RSL routines specify and dynamically manage efficient decomposition of multiple nested domains over processors and the communication to exchange data within each domain and between domains. The principal features of RSL are:
Under RSL all domains (individual grids) are decomposed and run in parallel over processors. The decomposition is called fine-grained because the unit of the decomposition is an individual grid-point, allowing work to be distributed and if necessary redistributed as evenly as possible over processors to maintain load balance - strictly rectangular decomposition of points is allowed but not required.
Routines for stencil and inter-domain data exchange routines are provided. The run-time system automatically generates communication schedules at run time that pack, communicate, and unpack all data involved in the exchange in a minimum number of messages, reducing the cost associated with message startup. The stencil and inter-domain communication semantics in RSL are also adaptable to shared memory/distribute memory hybrid schemes - a future direction for RSL.
Nested domains may be rectangular (as is conventional with these models) but RSL also supports a new feature - irregularly shaped nested domains. This feature allows a single nest to be fitted closely over an irregularly shaped feature of interest in a simulation. Unlike the conventional method of fitting a number of rectangular overlapping nests over the feature, a single polygonal nest avoids the need for complicated code in the parallel model to deal with the overlap regions. It also avoids duplicated computation in the overlaps.
Existing models may be readily adapted to parallelization using RSL. A parallel version of the Penn State/NCAR Mesoscale Model, MM5, was parallelized using RSL, which was originally developed for this purpose. Additional functionality has been added to further ease the process. For example, early prototype versions of RSL required that models be written as a set of column-callable functions. This was seen as a disadvantage because it requires a moderate to high level of modification. In addition to column callable expressions of the model code, RSL now provides constructs for managing local computation that allow the original loop structures and calling sequences of a model to be retained. This also eases the problem of maintenance if both parallel and non-parallel versions of a code are to be maintained by making the parallel code as close as possible to the source model.
To maintain backward compatibility with Fortran 77 implementations of models, this version of RSL does not allocate memory for model data structures; however, it is fully adapted for use with models that do dynamically allocate data. RSL provides size information so that each processor may can allocate only the amount of data needed locally for model arrays for a given decomposition. Ultimately, a run-time system and library supporting the development of parallel weather models should assume control over the allocation of memory structures used on each processor and the process should be entirely dynamic and transparent. Tighter integration with language support for dynamic memory and derived data types such as exists in Fortran 90 would be beneficial.
The purpose of this paper is to provide an introduction to RSL and its concepts, presenting usage examples where necessary, but deferring detailed information on the library to the reference manual (in progress) and UNIX-style manual pages distributed with the code. Section 2 gives a brief overview of the features of the type of computer model RSL is designed for. Section 3 discusses the issues involved in parallelizing a model and describes RSL's approach. Section 4 walks through the parallelization of a simple code using RSL. Section 5 discusses the implementation of a MM5 using RSL. Section 6 presents conclusions and directions for future work.
Finite difference models of dynamical systems are widespread in atmospheric and other sciences. The models typically consist of a 2- or 3-dimensional gridded domain representing the state of model variables such as velocity, temperature, and pressure, discretized at a given resolution. Most generally, a domain is initialized and then computed forward in time iteratively. Boundary input and model output is performed periodically.
At the beginning of the simulation the model domain is defined in terms of its size, shape, and allocation in memory, and the initial state of the model is established, usually by reading in data from an external source. A second source of model input may be lateral boundary conditions read in from an external source periodically over the course of the simulation. During each time step, the state of the model for the next time step is computed for each grid point by evaluating the state at the point and some stencil of nearest-neighbor grid points;

The exact shape and number of points in a stencil depends on the order of the finite-difference method and on the gridding scheme used. Interpolation will also involved some set of neighboring points and can also be thought of in terms of stencils.
Accurate resolution of weather phenomena improves with scale-appropriate resolution. However, as fineness of resolution increases so does computational cost because of the added number of grid points and the smaller time step. Nesting is used in these models to increase resolution over portions of a domain without incurring the added cost over the entire domain. Nesting is accomplished by positioning a higher resolution domain within a more coarse domain and exchanging information between the two, in the form of forcing and feedback data:
For each iteration over the main time loop, the parent domain advances one time step forward, then data in the region of the nest is transferred from the parent to the nest. The model then iterates several times over the smaller nested domain time steps. Finally, nested domain data is transferred back onto the corresponding region of the parent domain and the next time step commences.
Nested domains may themselves have nests, allowing simulations to reach arbitrarily fine resolutions within the limits of the particular dynamics and physics in the model.
Parallelizing a model on a distributed memory parallel computer involves defining, decomposing, and allocating memory for the model domains, managing the relationship between local and global index information, iteration and computation, interprocessor communication, load balancing, nesting, and I/O. RSL provides support for each of these tasks.
A domain under RSL is a 2-dimensional array of grid-points representing the horizontal discretization of the model and possessing a number of attributes which it acquired through definition, decomposition, and allocation.
Definition of a model domain is a description of the size, shape, and parentage. RSL allows rectangular and non-rectangular (irregular) domains. Size is specified explicitly for a regular domain by rows and columns, or implicitly for an irregular domain by specifying its outline. A domain may be any non-zero size provided it is totally enclosed by its parent domain (in the case of nest) and within the limits of physical memory. A nest is always defined as the child of a parent domain and parentage remains fixed for the duration of the nest. Multiple nested domains may be defined within a parent. There must always be a top-level ``mother'' domain that is defined first and only once. The mother domain is always rectangular (cannot be irregular) and has no parent defined.
Decomposition of a domain maps each grid-cell of the domain to a processor. All domains in a model are defined over the same set of processors. Viewed another way, each processor has a piece of every domain in the model. RSL automatically decomposes domains when they are defined. RSL's default algorithm divides the domains into partitions with as close to equal numbers of points as possible. As a option, the algorithm will adjusting processor boundaries around areas as small as a single point, if necessary. Domains may be redecomposed throughout a model run. The user may specify alternate decomposition algorithms.
Allocation pertains not to the domain itself but rather to the 2- and 3-dimensional arrays that store the state and intermediate variables used in the model. For a given decomposition, the arrays associated with a domain require a certain amount of memory on each processor. RSL does not actually allocate the arrays associated with a domain. Rather, it makes the size information available to the program. This size information may be used to allocate memory dynamically or simply to check that static sizes are large enough for a decomposition.
Implemented on a single processor, a model has a trivial correspondence between model array indices and the coordinates in the model domain: the position of a point in the domain and its array indices are the same and thought of interchangeably. Decomposition breaks this relationship - the index of a point in a local processor's memory is typically not the logical index of the point in the global domain. Therefore, the relationship between the local array indices and logical coordinates must be explicitly established and maintained.
RSL automatically computes and makes available to the program both sets
of indices. The indices in local data structures are used whenever a
local array is referenced in the code. No assumptions can be made by
the program about the actual value of these local indices except that a
point is always adjacent to the points
and
in a given
dimension. A corresponding set of global indices are used for
determining the position of a point within the logical domain: for
example, when testing for proximity with a boundary.
Since a processor computes only the points that are stored locally, there must be a mechanism for keeping track of a processor's local allocation in the parallel code. RSL assumes the responsibility for keeping track of the points that are local on each processor and for directing iteration over those points. A number of mechanisms are provided. RSL may actually control the iteration by applying model routines that the user provides as functional pointers, or it may simply make the partition information available to control iteration that is specified explicitly in the user program.
A model computation may be formulated as a function that is callable for a single point, a slab (a slice of consecutive points), or the entire domain. The function is passed as an argument to an RSL routine that invokes it for each local point, slab, or domain. Such an approach encourages a highly modular expression of a model, but may be less suitable for adapting an existing code to parallel operation.
For existing models when minimizing code changes is a concern, the information RSL uses to control iteration over local points can be extracted and used directly within the existing loop structure. This eliminates the need to redefine the model computation as an RSL-callable function. There is no sacrifice in utility - the system will still support irregular and dynamically remappable decompositions of points to processors - but it it preserves more of the code's original structure, it does not impose requirements on the argument list of the user routine, and it allows stencil exchanges to occur within the user's code rather than above it in the call tree.
Model computations that involve data from neighboring cells or from cells that exist on another domain will require communication if the cells reside on a different processor. To avoid the need for complicated, error prone, and potentially less efficient message passing code in the model, RSL provides high-level communication mechanisms for handling the types of data dependency found in finite-difference models with nests. The stencil provides intra-domain communication for finite-differencing and interpolation. The broadcast-merge provides communication for exchanging data between domains for nesting.
Intra-domain communication resolves the nearest-neighbor data dependencies associated with finite-differencing and horizontal interpolation. The set of neighboring points that have data needed for a computation is called a stencil. Under RSL, stencils are defined by specifying the points of the stencil and the fields (model variables) that should be exchanged on each of the points. Stencils are used in stencil exchanges: transfers of data from remotely stored points into extra cells of the local array that have been allocated around the partition. This padding is known as the ``halo'' or ``ghost'' region of an array. RSL automatically determines the size and shape of the ghost region for each defined stencil. During a stencil exchange, the needed data is automatically buffered on the sender and unbuffered on the receiver so that each stencil exchange involves only one message sent and one message received for each processor pair in the exchange, minimizing the latency cost of the transfer.
Inter-domain communication transfers the forcing or feedback data between a parent domain and a nest. At the time a nest is created, RSL establishes a link between each parent domain point and the points in the nest it overlays (Figure 1). The links are logical and do not depend on what processor a parent or nested domain point resides on. Downward forcing, from parent to nest, involves a logical ``broadcast'' from a parent domain point to the nest points that are linked to it. Upward forcing involves a ``merge'' along the same links but in the opposite direction.
Incidentally, RSL permits the ratio of nested to parent points to vary in
each horizontal dimension (but always ).
Load imbalance occurs when some processors have more work to do than others. Processors that finish first idle, reducing performance relative to the ideal (in which all processors are kept busy). A measure of this is efficiency, defined as the ratio of actual performance to ideal performance. Inefficiency from load imbalance may result from an uneven initial distribution of domain points to processors - especially if the number of processors does not evenly divide the number of rows or columns; from reduced amounts of work in the boundary points of a domain; or from dynamic conditions in the simulation itself that cause computations to be performed in some sections of the domain that are not performed in others. RSL addresses this problem by supporting optimal decompositions of points to processors whether or not the decomposition results in rectangular partitions, and by providing a mechanism for distributing and redistributing domain cells between processors. Implementing irregularly shaped processor decompositions would be prohibitively complicated in an explicit message-passing code or using High Performance Fortran, which supports only regular decompositions of work to processors. However, RSL supports this automatically, transparently, and with little additional overhead.
Models that support nested domains may also allow multiple overlapping nests so that a user can overlay a number of rectangular nests to closely fit a feature of interest in the simulation, such as a weather front. RSL supports multiple domains on a nest level, but the user can avoid writing complicated "overlap" code by specifying, instead, a single non-rectangular nested domain that is shaped to fit the feature of interest (Figure 2).
Reading data from a serial data set onto distributed domains and outputing distributed data out to serial data sets requires communication between processors and may also introduce a serial bottleneck in the parallel code. RSL provides routines that read and write model Fortran records from and to sequential Fortran data sets, automating the distribution of array elements to processors on input and the collection of array elements from processors on output. The parallel implementation of MM5, for example, is able to read and write serial MM5 data sets.
Although RSL manages the complicated task of decomposing serial input and recomposing serial output on-the-fly, the mechanism employed is currently ``single reader, single writer''; that is, one processor reads and writes the data to files and sends and receives messages to the other processors. This aspect of the system is presently non-scalable, a presents a disadvantage. However, atmospheric codes such as MM5 generate output at a relatively low frequency, typically once every one or two hundred time steps, so that this has yet to be a serious problem in our work with MM5. Implementing a scalable yet portable solution to parallel I/O is an issue that will be addressed in future implementations of RSL.
The previous discussion described the the problems associated with parallelizing finite-difference weather codes with nesting and RSL's approach to addressing these. In this section, a simple program is parallelized using RSL. The purpose of the section is to introduce some of the main routines and discuss their usage, though not exhaustively - a detailed description of the library is left to the appendices. Even in stripped-down form, the size and complexity of a weather model makes it unsuitable as an example. Instead, a simple grid-boundary relaxation program is used. The basic algorithm is:
Despite the fact that this is a toy problem, it presents most of the
basic problems for parallelization that one finds in a finite difference
weather code.
Each iteration of the model computes the value at each point as
the average of
its eight neighbors. Thus, when decomposed, some data necessary for the
local computation located on a different processor and
a stencil exchange must be defined and executed
before each successive phase of the computation.
To simulate the inter-domain communication in a model with a nested domain, the example has a nested grid. The nest is defined with an irregular boundary to demonstrate this capability in RSL. The complete algorithm with the nested communication is as follows:
After each relaxation, mother domain cells
transfer their values of to the underlying
nested domain points if those points lie on the boundary of the nest.
This is equivalent to the transfer of atmospheric variables
to update the boundaries of a nested domain. The relaxation
computation is performed several times on the nest,
then nested data is fed back onto the parent domain and the next major
time step commences.
Main sections of the program are detailed below. The complete Fortran 77 and Fortran 90 texts of the relaxation code are provided as appendices.
The main program is declared and the header file rsl.inc is included in Figure 3. Here, a cpp directive is used, but a standard Fortran INCLUDE also works. The RSL library must be initialized before it can be used. The number of processors decomposing the two horizontal dimensions of the domains are specified. The first number times the second is the total number of processors.
RSL is given a description of the domains that it will be working with. Once described, a domain becomes active and an integer descriptor is returned for use when performing computation, stencil exchanges, opening nests, remapping, and other operations on the domain. The top level ``mother'' domain must be specified. Nests may be opened in any domain that is currently active at any time.
One domain must always be defined in a model. The code in Figure 4 declares this top-level mother domain for the relaxation program. M and n are integer variables giving the global number of rows and columns of the 2-dimensional domain. Although nested domains may be irregularly shaped, the mother domain is always rectangular. The definition of the domain also results in its automatic decomposition. Here, the default decomposition algorithm in RSL is used. The decomposition generated for four processors is shown in Figure 5. The decomposition of the nested domain is shown in 7.
On return, RSL_MOTHER_DOMAINsets the integer arguments mloc and nloc, specifying the minimum local array size that will hold arrays associated with the decomposed domain. The values of mloc and nloc may differ on each processor. If the calling program uses dynamic memory allocation, such as is available in Fortran 90, the sizes can be used to allocate the model arrays. If not, the sizes may still be used to check that statically allocated arrays are large enough for the decomposition.
The value of RSL_8PT, defined in the rsl.inc, tells RSL what the largest stencil will be. In this case, an 8-point stencil is specified ( 9-points if one counts the center point). On return, the integer variable did is set to the value of an RSL handle that refers to the domain just declared.
The relaxation problem in the example also specifies a nested domain. The call to RSL_SPAWN_IRREG_NESTin Figure 4 defines a nested domain and positions it within the coarse domain. The resolution of the nest is three times that of the coarse domain in each of the two horizontal dimensions, associating nine nested cells with each coarse domain cell in the region of the nest.
The integer argument did specifies the parent domain. On return, the integer argument nid is set to a handle that will be used when referring to the nest in subsequent calls to RSL.
If the nest were rectangular, it could be simply specified
by its height and width and the coordinates of a corner using a
different routine: RSL_SPAWN_REGULAR_NEST. However, the domain
is polygonal so its
shape is
specified by defining its outline.
Xlist and ylist are the and
coarse-domain
coordinates of the outline. The next argument is a count of the number
of elements in each
of the lists. Although the lists are hard-coded here,
the outline coordinates could also
come from a file or an interactive window that allowed
the user to draw the nest's shape using a mouse.
Since the nesting ratio in each
dimension is three, the dimensions of the nest in each dimension must
be a multiple of 3, which may result in
iteration over parts of the nest that do not really exist. Therefore,
a number of cells
may be trimmed from each dimension by specifying a trim in
each dimension. Trims of 2 and 2 are specified here.
The trim may be in the range of integers zero through the
nesting ratio minus one.
Cells are always trimmed
from the high end of a dimension. In this case, the untrimmed
length of and
in the nest is reduced to
and
. The resulting parent and nested domains are shown in Figure
6
As with the mother domain, the nest is automatically decomposed when it is instantiated. The routine returns the required memory size for the local partition (mloc_n and nloc_n) and the global size of the nest as if it were rectangular. In other words, m_n and n_n returned above specify the size of the rectangle that would totally enclose the nest, even though the nest's shape may not cover every point of that rectangle.
The relaxation computation uses the eight neighboring values of the
variable for each non-boundary cell of the domain. Since the
horizontal dimensions of the domain are decomposed, some needed data
will lie on other processors. Therefore, we define an 8-point stencil
that can be used by the routine RSL_EXCH_STENCILexchange data
prior to the computation.
RSL stencils are comprised of messages, which are comprised of
one or more field descriptions. A stencil for the variable
is
defined in Figure 8.
Building a stencil involves specifying the messages that will be required for each point of the stencil. Therefore, the messages are constructed first.
The call to RSL_CREATE_MESSAGEcreates a temporary message descriptor that can be built up by adding field descriptions. The messages descriptor stays in effect until it is used in the description of the stencil, after which point the message descriptor becomes undefined. The call to RSL_BUILD_MESSAGEadds the 2-dimensional array
two or three. The last three arguments, decomp, glen, and llen, are integer arrays that specify how the dimensions are decomposed, the global or logical length of the dimensions, and the local declared lengths of the dimensions. The arrays are set at the top of the figure. The lowest numbered index is most minor.
Once the messages for the stencil are built (in this case there is only one), the messages are assigned to the stencil points by first assigning them to an integer array whose length is equal to the number of stencil points, then using the array to describe the stencil in a call to RSL_DESCRIBE_STENCIL. The stencil descriptor being described must first have been created with a call to RSL_CREATE_STENCIL.
Stencil points are always numbered from top-to-bottom and left-to-right. Thus, for an 8-point stencil, index 1 in the points array is the descriptor for the ``northwest'' message. Index 8 is the descriptor for the ``southeast'' message. The center point of the stencil is never specified. This example shows the same message being assigned to each point of the stencil. However, different messages may be assigned to different stencil points. If a stencil point is not to be used, the corresponding array element should be assigned the value RSL_INVALID, instead.
The stencil is described and associated with a domain (in this case, the mother domain, did) with the call to RSL_DESCRIBE_STENCIL. Stencils are specific to the domain for which they are described (this is different from the earlier prototype version of RSL). Therefore, a stencil (and the messages that make up the stencil points) must be described from scratch for each new domain. This is not burdensome, because the code to do this can be easily generalized to a routine callable for an arbitrary domain.
It is no longer necessary to explicitly compile stencils before they can be used in exchanges. RSL now does this automatically at the point of first use in an exchange (RSL_EXCH_STEN). The cost for compiling the communication schedule is paid only once and the schedule remains in effect until the domain is re-decomposed. Re-decomposing the domain triggers one new compilation of the stencil communication schedule on next first use.
The first computational part of the program is iteration
over in the mother domain, setting its elements to initial values.
In the example, boundary cells
are given initial values of 10.0. Interior cells are set to 0.0.
The code is shown in Figure 9, and makes use of two loops that iterate over specified ranges of the logical domain. The first loop sets all cells to 10.0; the second loop sets the interior to 0.0. Another way to handle boundary points separately from the interior is to code a single loop with a conditional in the body. This is done in the relaxation computation later in the example.
The RSL mechanism that allows each processor to iterate only over the points it has stored locally is initialized and used in the figure. The call to RSL_GET_RUN_INFOinitializes RSL_MAJOR_LOOP_BOUNDEDand RSL_MINOR_LOOP_BOUNDED, which take the place of DO statements in a serial code. The first argument to the constructs is the name of the local index variable, followed by the upper and lower ranges of the loop. Other forms are provided for iteration and boundary treatment of irregularly shaped domains.
Computation of one relaxation step for either domain, mother or nest, involves iterating over the local points on each processor and computing the average value of the eight neighbors, except on the boundaries, which are held fixed. The code is shown in Figure (10).
The first call is to RSL_GET_RUN_INFO, which initializes the loop
mechanism, allowing iteration over the local points local on each
processor.
Next, RSL_GET_BDY_LARRAYinitializes the
integer array bdyinfo with
boundary proximity information
Bdyinfo is a 3-dimensional array. The first two dimensions
correspond to the horizontal dimensions of the domain and are indexed
by the local indices and
to access the proximity information
for a given point.
The third dimension contains the proximity information.
Element one, bdyinfo(i,j,1), provides the distance
to the closest boundary for the point
. Element two,
bdyinfo(i,j,2), provides the direction to the boundary (RSL_MLOW,
RSL_MHIGH, RSL_NLOW, or RSL_NHIGH).
Using this information, code intended for the interior or for the boundary
of a domain can be executed conditionally within the body of a loop.
(Recall that boundary treatment in parallel models must allow for the
fact that
local and global indices in a decomposed domain are distinct.)
Before the relaxation, a stencil exchange is performed. RSL_EXCH_STENCILis called with id set to the identifier of the domain (did or nid) and sten set to the stencil descriptor that was defined earlier for the domain. The exchange ensures that the data for every point needed in the average calculation will be on-board, either locally or in the surrounding ghost-region of the local array.
Each processor iterates over its local set of points. Points on the boundary are held fixed; averages are computed on interior points. The boundary test is a conditional involving the bdyinfo array: points that are on the boundary have a distance of zero from the boundary, whatever shape that boundary happens to be.
Following the boundary calculation in the first set of nested loops,
the variable is updated in a second set of loops. Note that unlike
the the first loop body, the second loop body for the update has
no horizontal dependencies. Therefore, no prior stencil exchange is
required.
The exchange of data from a parent domain to a nest or and the reverse are accomplished in two phases. The first phase is a packing operation on the sending domain followed and the send of the data. The second phase is unpacking data on the receiving domain. Figures 11 and 12 show the two phases of the exchange of forcing data from the parent domain to the nest prior to the start of the nested steps.
The packing in Figure 11 is implemented as a conditional
loop that performs one iteration for each point in the nested domain,
specified by nid.
Each time RSL_TO_CHILD_INFOis called, it returns with the global
coordinates
of the nested point (nig, njg), the local and global indices of
the forcing parent domain point (i,j, pig, pjg), and the index
of the nested point (cm, cn) within the set of nested points linked to
the coarse domain point (the indexing runs the same as the domain indexing
in both dimensions - that is, cm is minor, and run from ``bottom to top'',
as the index does). When there are no more points, it returns
a value of -1 in the integer argument retval.
The call to RSL_GET_BDY_GPT(``get boundary information for a globally
indexed point'') is used to determine boundary proximity for the
nested point nig, njg. On return, the two-element integer
array bdyinfo will have the first element set to the distance to the
nearest boundary for the point (recall this is an irregularly shaped
domain) and second element set to the direction to the boundary (RSL_MLOW,
RSL_MHIGH, RSL_NLOW, or RSL_NHIGH). If the nested cell is on
a boundary (bdyinfo(1) is zero), RSL_TO_CHILD_MSGis called
to pack the value of X(i,j), the parent copy of the variable ,
into the message destined for point nig, njg on the nest.
In this example, only one variable is being sent; however,
RSL_TO_CHILD_MSGmay be called as many times as necessary within
the loop body to send data from different parent arrays to the nested point.
After the completion of the conditional loop in Phase I, the messages are
actually sent between processors with the call to RSL_BCAST_MSGS.
Even though a packing iteration occurs for every nest point, data is actually packed and a message sent only for points for which bdyinfo(1) is zero. Thus, the size of the buffers being sent and received during the call to RSL_BCAST_MSGSare actually quite small. Like the intra-domain stencil exchanges, communication schedules are used to ensure that the exchange entails only one message for each processor receiving data and the schedules are compiled only once on first-use.
In Phase II of the broadcast, data from the parent domain cells is unpacked
onto the nest. Again, a conditional loop is used, this time with calls
to RSL_FROM_PARENT_INFO(Figure 12). On each iteration
of the loop, RSL_FROM_PARENT_INFOreturns the local and global
indices (i, j, nig, njg) of a nested point that data has arrived for
and the global indices of the sending parent cell (pig, pjg).
When there are no more points to unpack, RSL_FROM_PARENT_INFOwith
a value of -1 in retval. Within each iteration, RSL_FROM_PARENT_MSGis called to unpack the data into the nested data array, in this case
the nested copy of the variable . RSL_FROM_PARENT_MSGonce for
each time RSL_TO_CHILD_MSGwas called during Phase I, and in the same
order.
The upward forcing of nested data to the parent domain at the end of the set of nested iterations is similar in structure to the downward forcing described here. Different packing and unpacking routines are called, and instead of RSL_BCAST_MSGS, RSL_MERGE_MSGSis used.
Domains are automatically decomposed when they are defined. However, the user may redefine the decomposition throughout the simulation by specifying a new decomposition in the form of a function that RSL can call to reassign points to different processors. A remapping is done for a single domain, and has no effect on other domains in the simulation.
The simplest remapping situation is if the data structures for a domain have not yet been allocated and initialized. In this case, the remapping is accomplished using RSL_FDECOMPOSE:
rsl_fdecompose( d, fcn, nproc_m, nproc_n, info, mloc, nloc )
The first argument is the integer RSL domain handle. The fcn
argument is a user supplied function that RSL will use to determine
a new processor assignment for each point in the domain. Nproc_m
and nproc_n are the numbers of processors decomposing
and
, respectively. The info argument may
be used to pass information, such as per-processor timing information,
to the mapping function for use in determining the new decomposition.
Mloc and nloc are integers
that, on return, contain the
minimum local array size for
the newly decomposed domain.
The remapping can be performed even after the arrays have been allocated and filled with domain state data. This involves reallocating the data structures (since the local size requirements may change) and moving data within and between processors to conform with the layout specified in the new decomposition. This movement is accomplished using a specially defined RSL state message and the routine RSL_REMAP_STATEFigure 13.
RSL uses a
state message to know how to pack and unpack fields when
moving points between processors. The variable dstate
in the figure is a Fortran90 structure containing pointers to and
other dynamically allocated model state arrays. The variable
tmp is an identical structure that is used to hold the
new state while the remapping from the orignal state is underway.
Once the remapping is complete the new state in tmp becomes the
state stored in dstate by means of Fortran90 pointer assignments
in a user-supplied routine, move_state.
Fortran90 is used to simplify the management of the user data structures involved in the example remapping. Obviously, pointers, dynamically allocatable arrays, and derived types (structures) will simplify the task in the user's code as well. However, the ramapping can be accomplished somewhat less easily within statically allocated data structures under Fortran77. The RSL mechanism is identical in either case.
The main loop of the example is analogous to the time loop in a
weather model. On each iteration, the domain and its nests are
brought forward one increment in time, a process requiring multiple
iterations on the higher resolution nested domains for each step
on the parent.
After a given number of iterations, the model writes the contents of
the mother and nest domains to separate files. Each array is output
as a separate unformatted Fortran record containing by
floating
point values (
-minor). The code is show in Figure 14.
The output is performed by the calls to RSL_WRITE. The first argument
is the Fortran unit number of the data set. This is followed by the
data description flag IO2D, defined in the rsl header file rsl.inc.
IO2D specifies that the array being written
is 2-dimensional. Implicit is the assumption that both dimensions
are horizontal and decomposed, and that the minor dimension of the
array corresponds to the minor dimension of the domain. That is,
in the domain and
in the array
are both minor. The next
argument is the
array itself, followed by the domain descriptor.
RSL_REALis a tag specifying the type of the array. The arrays
glen and llen specify the global and local sizes of the
arrays (glen_n and llen_n store this information for the
nest in this example). The first element of each array corresponds
to the minor dimension (
), followed by the major dimension. The
size of the array as allocated on a given processor is llen(1)
by llen(2) (the size is allowed to vary from processor to processor).
The arrays as written to records in the file will be glen(1) by
glen(2).
The relaxation code described in this section was coded and run 50 steps on multiple processors of an IBM SP2, generating output every five steps. Plots of the output data after the full 50 steps on the coarse domain and the nest are shown in Figure 15.
MM5 is a regional weather model for prediction on domains ranging
from several thousand kilometers down to several hundred (or fewer) kilometers.
The model may be run at a resolution of as low as 1 km. Domains are
uniform rectangular grids representing three-dimensional regions of
the atmosphere. The horizontal coordinate system is equally spaced
geographically and the model uses the Arakawa-B gridding scheme. The
vertical coordinate system is surfaces, with layers distributed
more closely nearer the surface (23 layers in the current model).
For this implementation, atmospheric
dynamics is nonhydrostatic and uses finite-difference approximation.
Physics includes the Blackadar high-resolution planetary boundary layer
scheme, the Grell cumulus scheme, explicit moisture with treatment
of mixed-phase processes (ice), shallow convection, dry convective
adjustment, and the Dudhia long- and short-wave radiation
scheme [1].
MM5 was parallelized using the first generation of RSL and was first converted to column-callable form before parallelization using RSL. MPMM, as the parallel version is called (for Massively Parallel Mesoscale Model), was validated and benchmarked on the IBM SP1 and SP2. It has since been ported to the Intel Paragon and other platforms, including the Fujutsu AP1000 at Australia National University. Performance of 1.2 Gflops has been generated for a single domain problem running on 64 SP1 processors. A smaller, 14 node, SP2 configuration Additional information regarding the parallel MM5 is available in [2] and on the World Wide Web at the location http://www.mcs.anl.gov/Projects/mpmm/index.html.
Parallelizing MM5 provided a number of insights and suggested improvements that have been incorporated into the latest version of RSL. These include the explicit loop constructs that allowed relaxation of column callability, support for dynamic memory allocation at the option of the programmer, removal from RSL of the need to explicitly compile stencils and broadcast-merges before use (they are now compiled automatically on first use), and the automatic and default decomposition of domains.
RSL provides a system whereby an existing or new weather model can be implemented on a parallel machine with support for dynamic, fine grain parallel decomposition of multiple nested domains and with high-level communication routines that hide and efficiently implement message passing. Improvements to the original version of RSL have also reduced the number of changes to an existing code, simplifying initial implementation, ongoing maintenance, and effectively removing a source of architecture dependent coding from the model. New features and enhancements to the system will include:
Porting to other parallel machines such as the Cray T3D is already simplified because RSL can be implemented atop the MPI standard message passing layer. The I/O mechanism of RSL is currently a potential obstacle to scalability and must be addressed in a portable manner.
Hybrid parallelism is the mixing of shared and distributed memory approaches in a single architecture: a set of distributed memories representing the computational nodes of a message passing computer but on each of which multiple processing elements operate in shared memory mode. The semantics embodied in the computational and communication structures already in RSL will support this if the underlying mechanism is available.
A fully Fortran-90 implementation has the advantage of incorporating derived types and dynamic memory allocation directly into RSL, allowing it to provide additional levels of support for managing multiple parallel domains in separate processor memories. One might, for example, construct must of the underlying structure of a parallel model simply by specifying to the run time system a list of variables and their logical dimensions, giving the run-time system complete control over their allocation and reallocation, further releiving the programmer of the task of explicitly managing this (admittedly, this feature would be more useful for a new code than an existing code).
The long-range benefit to ceding this kind of control to a run-time system is that it hides more and more of the implementation details and makes the top level description of the program increasingly architecturally independent. There will be a great benefit to the scientific community if a single expression of a weather model can be made to run equally well on vector, RISC, shared, and distributed memory parallel architectures. The goal will prove elusive in the foreseeable future for the general class of scientific applications precisely because of their ambitious scope. However, a focused approach that addressed the needs and requirements of a particular type of application - in this case, grid-based weather and related codes - has a good chance of success in the near term. Elements of a total system will include source translation and run-time system technology augmenting basic compiler technology. Therefore, the run-time system library approach of RSL represents a step in the direction toward architecture independence.
The following is a Fortran-90 implementation of the example code in Section 4. Each model domains is represented as a Fortran-90 derived type and the nesting hierarchy is implemented as pointers between these structures. The main computational part of the code is a recursive descent of this tree for each time step. The number of processors decomposing the code and the size of all data model data structures are defined at run-time and dynamically allocated. The data for Figure 15 was generated using this program on 4 processors of an IBM SP2 computer.
Implementation note: the IBM Fortran compiler is already F90 compliant and supports recursion. However, stack allocation of local variables may not be the default at your installation. Therefore, compile the code with -qnosave.
CC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CC
CC FORTRAN 90 MODULE DEFINITION FOR A 'DOMAIN'
CC
CC User program definition of domain data structure and
CC manipulation routines.
CC
module domains_module
integer :: maxkids ! maximum number of children
parameter ( maxkids = 5 )
C
C Domainstruct is the principal domain data structure in the user
C program. It contains the size of the domain in local
C memory and logically. Also pointers to parent and child domains,
C if any. Also the RSL domain handle. Also, the domain state
C arrays themselves (unallocated, so virtually no storage expended
C unless the domain is actually used).
C
type domainstruct
C
C This section has "meta" information about the domain.
C
type( domainstruct ), pointer ::
$ parent, ! parent domain
$ child(:) ! children
integer nkids ! number of (active) children
integer nestlevel ! nestlevel of this domain
integer domdesc ! RSL domain handle
logical active ! flag
integer m, mloc ! global and local dimensions in m
integer n, nloc ! global and local dimensions in n
integer sten ! stencil descriptor
C
C This section has the state arrays for the domain.
C
real, pointer :: X(:,:)
endtype domainstruct
C
C Declaration of the top-level domain (all others are children of this one)
C
type ( domainstruct ), target :: mother
C
C Domain manipulation routines
C
contains
C
C Initialize a domain data structure
C
subroutine init_domain( d )
type(domainstruct) :: d
d%active = .false.
d%m = 0
d%n = 0
end subroutine
C
C Allocate the fields of the domain (including state arrays). Once
C this is called, the domain will take more than nominal storage.
C
subroutine allocate_domain( d, m, n, mloc, nloc )
type(domainstruct) :: d ! domain structure
integer m, n, mloc, nloc ! global and local sizes
integer k ! child index
if ( d%active ) then
write(0,*) 'allocate_domain: domain already active.'
stop
endif
C
C Create and initialize structures for future children. Note: only
C the child domain structures and not the child state arrays are
C being allocated here -- thus, the children require only nominal
C storage until actually activated and allocated.
C
allocate (d%child(maxkids))
do k = 1, maxkids
call init_domain( d%child(k) )
enddo
C
C Set the size information about this domain
C
d%m = m
d%n = n
d%mloc = mloc
d%nloc = nloc
d%active = .true.
d%nkids = 0
C
C Allocate storage for state arrays.
C
allocate (d%X(mloc,nloc))
end subroutine
end module
CC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CC
CC MAIN PROGRAM
CC
CC This is the main routine for a Fortran-90/RSL implementation of
CC a simple relaxation problem on a grid, with an irregularly shaped
CC nested domain.
CC
CC The bulk of the main routine, here, is involved with defining
CC and allocating the problem, which can be done entirely at run-time
CC by exploiting modern features of F90 such as derived data types
CC and dynamic memory allocation.
CC
CC The model itself is is executed in one call to another routine,
CC ITERATE_MODEL
CC at the bottom of this main program.
CC
program relaxmain
use domains_module
implicit none
type(domainstruct), pointer :: d
#include "rsl.inc"
integer m,n,mloc,nloc,did ! coarse domain params
integer m_n,n_n,mloc_n,nloc_n,nid ! nested domain params
integer mtrim, ntrim ! trim for the nest
integer xlist(20), ylist(20), npoints ! outline of nest
integer iter ! number of iterations to perform
integer k ! child index
integer nproc_m, nproc_n ! number of procs in m, n
C
C Input run-time problem configuration information
C
read(7,*)m,n ! mother domain size specif. at run time
read(7,*)iter ! number of mother domain iterations
read(7,*)nproc_m, nproc_n ! number of processors specif. at run time
C
C Initialize RSL
C
call rsl_initialize( nproc_m, nproc_n )
C
C Mother domain. Note that the RSL routine is called before the
C domain is allocated. This allows the local state arrays to be
C allocated only as large as necessary (using the size information
C returned by rsl_mother_domain in mloc and nloc).
C
call init_domain(mother)
C
call rsl_mother_domain(
$ did, ! output: RSL domain handle
$ RSL_8PT, ! input: max stencil
$ m, n, ! input: global size
$ mloc, nloc ) ! output: local size
C
call allocate_domain( mother, m, n, mloc, nloc )
mother%domdesc = did ! store RSL handle in domain
mother%nestlevel = 1 ! nest level of mother is 1
C
C Specify the outline of the nest in coarse domain coordinates
C
xlist(1) = 4 ; ylist(1) = 6
xlist(2) = 4 ; ylist(2) = 11
xlist(3) = 7 ; ylist(3) = 11
xlist(4) = 10 ; ylist(4) = 14
xlist(5) = 15 ; ylist(5) = 14
xlist(6) = 15 ; ylist(6) = 9
xlist(7) = 10 ; ylist(7) = 4
xlist(8) = 6 ; ylist(8) = 4
xlist(9) = 4 ; ylist(9) = 6
npoints = 9
C
C Specify the "trim" for the nest
C
mtrim = 2
ntrim = 2
C
C Spawn an irregular nest using the outline and trim information
C specified above.
C
call rsl_spawn_irreg_nest(
$ nid, ! output: nest handle
$ mother%domdesc, ! input: parent handle
$ RSL_8PT, ! input: max stencil
$ xlist, ylist, npoints, ! input: domain outline
$ 3, 3, ! input: nesting ratios
$ mtrim, ntrim, ! input: trim effective size
$ mloc_n, nloc_n, ! output: local memory size
$ m_n, n_n ) ! output: global size
C
C Allocate using local size information returned by RSL.
C
call allocate_domain( mother%child(1), m_n, n_n, mloc_n, nloc_n )
C
C Set fields in domain structures associating nest with parent
C
mother%nkids = 1 ! mother has one nest
mother%child(1)%parent => mother ! back pointer to parent
mother%child(1)%domdesc = nid ! store RSL handle
mother%child(1)%nestlevel = mother%nestlevel+1
C
C Define the stencil communications on the mother and all subnests
C
call define_data( mother ) ! user routine: recursive
C
C Initialize the interior of the mother and the boundary cells
C
call init_grid( mother, mloc, nloc, 1 )
C
C Initialize the nests with data from the mother. (There is only
C one nest in this example, but this code allows for more).
C
do k = 1, mother%nkids
call initial_nest_data( mother, mother%child(k) )
enddo
C
C Write initial state of the model
C
call output_domains( mother ) ! recursive
C
C Execute 'iter' iterations of the simulation. The following call
C executes on the mother and recursively on all nests.
C
call iterate_model( mother, 1, iter ) ! main time loop (recursive)
C
stop
end
CC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CC
CC ITERATE_MODEL
CC
CC Main computational routine -- loop over time, iteration over mother
CC and all subnests, and control of inter-domain data exchanges.
CC
recursive subroutine iterate_model( d, iter1, itern )
use domains_module
implicit none
#include "rsl.inc"
type(domainstruct) :: d ! input: domain
integer iter1, itern ! input: starting and ending steps
integer t ! local: time step on this domain
integer k ! local: child index
do t = iter1, itern
call relax_grid( d, d%mloc, d%nloc, 1 ) ! compute this domain
do k = 1, d%nkids ! for each nest...
call force_domain( d, d%child(k) ) ! force
call iterate_model( d%child(k), 1, 3 ) ! RECURSIVE CALL
call merge_domain( d, d%child(k) ) ! feedback
enddo
if ( d%nestlevel .eq. 1 .and.
$ ((mod(t-1, 5) .eq. 0) .or. (t .eq. itern))
$ ) then
call output_domains( d ) ! if time, output
endif
enddo
return
end
CC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CC
CC RELAX_GRID
CC
CC This is the main computational routine that is called for one step
CC on a domain. The new value for each point is computed as the
CC average of the values of 8 neighbors. Prior to the computation,
CC a stencil exchange is performed to ensure that off-processor data
CC will be available for the computation.
CC
subroutine relax_grid( d, ilen, jlen, klen )
use domains_module
implicit none
#include "LoopMacros.inc"
type(domainstruct) :: d ! input: domain being computed
integer ilen, ! input: local array sizes in m
$ jlen, ! input: local array sizes in n
$ klen ! input: local array sizes in vertical
RSL_DECLARE_RUN_VARS( jlen ) ! declares looping information
integer m, n, i, j ! local: misc info
real New( ilen, jlen ) ! local: temporary array
integer bdyinfo(ilen,jlen,2) ! local: proximity information
real, pointer :: X(:,:) ! local: state array pointer
C
C Initialize RSL loop constructs
C
call rsl_get_run_info( d%domdesc, jlen, RSL_RUN_VARS )
C
C Get boundary proximity information for each cell from RSL.
C
call rsl_get_bdy_larray( d%domdesc, bdyinfo )
C
C Exchange data with other processors on this domain using the stencil
C information defined for and stored with the domain.
C
call rsl_exch_stencil( d%domdesc, d%sten )
C
C Set pointer to domain state array. Size info.
C
X => d%X
m = d%m
n = d%n
C
C Main loop over horizontal dimensions of partition of array that
C is stored on local processor. If a boundary cell, hold fixed,
C otherwise compute average. Boundary cells are those with a boundary
C proximity of zero (i.e. they are zero cells away from a boundary).
C
RSL_MAJOR_LOOP(j)
RSL_MINOR_LOOP(i)
if ( bdyinfo(i,j,1) .eq. 0 ) then
New(i,j) = X(i,j)
else
New(i,j) = (
$ X(i+1,j-1) + X(i+1,j) + X(i+1,j+1) +
$ X(i,j-1) + X(i,j+1) +
$ X(i-1,j-1) + X(i-1,j) + X(i-1,j+1)
$ ) / 8.0
endif
RSL_END_MINOR_LOOP
RSL_END_MAJOR_LOOP
C
C Update X.
C
RSL_MAJOR_LOOP(j)
RSL_MINOR_LOOP(i)
X(i,j) = New(i,j)
RSL_END_MINOR_LOOP
RSL_END_MAJOR_LOOP
C
return
end
CC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CC
CC OUTPUT_DOMAINS
CC
CC Called initially and periodically. Outputs state of model on
CC to separate files for each nest level.
CC
recursive subroutine output_domains( d )
use domains_module
implicit none
#include "rsl.inc"
type(domainstruct) :: d ! input: domain
integer k ! local: child index
integer glen(2), llen(2) ! local: size arrays
C
glen(1) = d%m
glen(2) = d%n
llen(1) = d%mloc
llen(2) = d%nloc
C
C Output top level domain
C
call rsl_write( 18+d%nestlevel-1, ! Fortran unit for output
$ IO2D_IJ, ! describe record
$ d%X, ! data for record
$ d%domdesc, ! domain descriptor
$ RSL_REAL, ! type of each element
$ glen, llen ) ! size info
C
C Foreach nest, output it and its subnests recursively.
C
do k = 1, d%nkids
call output_domains( d%child(k) ) ! recurse
enddo
return
end
CC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CC
CC INITIAL_NEST_DATA
CC
CC Set the cells in the nest to values transferred down from the parent.
CC This is called to initialize the nest.
CC
subroutine initial_nest_data ( d, nst )
use domains_module
implicit none
#include "rsl.inc"
type(domainstruct) :: d, nst ! input: parent and nest
integer pi, pj, pig, pjg ! local: parent indices
integer ni, nj, nig, njg ! local: nest indices
integer m, n, i, j, msize ! local: misc variables
integer cm, cn ! local: relative nest index
integer retval ! local: return value
real, pointer :: X(:,:) ! local: pointer to state array
C
C Point to parent's state arrays.
C
X => d%X
C
C Build a message for each point on the nest using data from the
C overlying cell in the parent domain. Loop goes until we have handled
C data from all the parent domain points on this processor.
C
call rsl_to_child_info( d%domdesc, nst%domdesc, 4,
$ i, j, pig, pjg, cm, cn, nig, njg, retval )
do while ( retval .eq. 1 )
call rsl_to_child_msg( 4, X(i,j) )
call rsl_to_child_info( d%domdesc, nst%domdesc, 4,
$ i, j, pig, pjg, cm, cn, nig, njg, retval )
enddo
C
C Exchange the data using RSL inter-domain communication.
C
call rsl_bcast_msgs
C
C Now, point to nest state data
C
X => nst%X
C
C Unpack the message on each point of the nest. Loop goes until we
C have unpacked all the nested domain points that are local to this
C processor.
C
call rsl_from_parent_info( i, j, nig, njg, cm, cn, pig, pjg, retval )
do while ( retval .eq. 1 )
call rsl_from_parent_msg( 4, X(i,j) )
call rsl_from_parent_info( i, j, nig, njg, cm, cn, pig, pjg, retval )
enddo
C
return
end
CC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CC
CC FORCE_DOMAIN
CC
CC Similar to init_domain, except this is called at
CC each step to force only the boundaries of the nest
CC (not the entire domain, as in init_domain).
CC
subroutine force_domain ( d, nst )
use domains_module
implicit none
#include "rsl.inc"
type(domainstruct) :: d, nst ! input: parent and nest
integer pi, pj, pig, pjg ! local: parent indices
integer ni, nj, nig, njg ! local: nest indices
integer m, n, i, j, msize ! local: misc variables
integer cm, cn ! local: relative nest index
integer retval ! local: return value
real, pointer :: X(:,:) ! local: pointer to state array
integer bdyinfo(2) ! local: boundary proximity
C
C Point to parent's state arrays.
C
X => d%X
C
C Build a message for ONLY THOSE POINTS on the nest that are on a boundary.
C The call to rsl_get_bdy_gpt gets the proximity information for a nested
C point. The information is then used to decide whether or not data should
C be packed for that point. Note: RSL will automatically size the messages
C between processors to exchange only that data that is packed.
C
call rsl_to_child_info( d%domdesc, nst%domdesc, 4,
$ i, j, pig, pjg, cm, cn, nig, njg, retval )
do while ( retval .eq. 1 )
call rsl_get_bdy_gpt( nst%domdesc, bdyinfo, nig, njg )
if ( bdyinfo(1) .eq. 0 ) then
call rsl_to_child_msg( 4, X(i,j) )
endif
call rsl_to_child_info( d%domdesc, nst%domdesc, 4,
$ i, j, pig, pjg, cm, cn, nig, njg, retval )
enddo
C
C Exchange the data using RSL inter-domain communication.
C
call rsl_bcast_msgs
C
C Now, point to nest state data
C
X => nst%X
C
C Unpack the message on each point of the nest. Because the first phase
C of the exchange packed only data for the boundary points, this loop
C will iterate only over those.
C
call rsl_from_parent_info( i, j, nig, njg, cm, cn, pig, pjg, retval )
do while ( retval .eq. 1 )
call rsl_from_parent_msg( 4, X(i,j) )
call rsl_from_parent_info( i, j, nig, njg, cm, cn, pig, pjg, retval )
enddo
return
end
CC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CC
CC MERGE_DOMAIN
CC
CC This implements the feedback of data from the nest to the parent.
CC The structure is similar to FORCE_DOMAIN, except that the flow
CC of information is in the opposite direction. Data is returned only
CC for the center nest point under each coarse domain point (cm=2, cn=2).
CC
subroutine merge_domain ( d, nst )
use domains_module
implicit none
#include "rsl.inc"
type(domainstruct) :: d, nst ! input: parent and nest
integer pi, pj, pig, pjg ! local: parent indices
integer ni, nj, nig, njg ! local: nest indices
integer m, n, i, j, msize ! local: misc variables
integer cm, cn ! local: relative nest index
integer retval ! local: return value
real, pointer :: X(:,:) ! local: pointer to state array
C
X =>nst%X
call rsl_to_parent_info( d%domdesc, nst%domdesc, 4,
$ i, j, nig, njg, cm, cn, pig, pjg, retval )
do while ( retval .eq. 1 )
if ( cm .eq. 1 .and. cn .eq. 1 ) then
call rsl_to_parent_msg( 4, X(i,j) )
endif
call rsl_to_parent_info( d%domdesc, nst%domdesc, 4,
$ i, j, nig, njg, cm, cn, pig, pjg, retval )
enddo
C
call rsl_merge_msgs
C
X => d%X
call rsl_from_child_info( i, j, pig, pjg, cm, cn, nig, njg, retval )
do while ( retval .eq. 1 )
if ( cm .eq. 2 .and. cn .eq. 2 ) then
call rsl_from_child_msg( 4, X(i,j) )
endif
call rsl_from_child_info( i, j, pig, pjg, cm, cn, nig, njg, retval )
enddo
C
return
end
CC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CC
CC INIT_GRID
CC
CC This is a computational routine whose purpose it is to assign
CC a domain with initial values. Boundary cells receive a non-zero
CC initial value. The interior receives a zero value.
CC
subroutine init_grid( d, ilen, jlen, klen )
use domains_module
implicit none
type(domainstruct) :: d
integer ilen, jlen, klen
RSL_DECLARE_RUN_VARS( 500 )
integer bdyinfo(ilen,jlen,2)
integer m, n, i, j
integer cn, cm
real, pointer :: X(:,:)
C
call rsl_get_run_info( d%domdesc, jlen, RSL_RUN_VARS )
call rsl_get_bdy_larray( d%domdesc, bdyinfo )
C
X => d%X
m = d%m
n = d%n
C
RSL_MAJOR_LOOP(j)
RSL_MINOR_LOOP(i)
if ( bdyinfo(i,j,1) .eq. 0 ) then
X(i,j) = 10.0
else
X(i,j) = 0.0
endif
RSL_END_MINOR_LOOP
RSL_END_MAJOR_LOOP
C
return
end
CC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CC
CC DEFINE_DATA
CC
CC This is called only once, prior to the first step of the model, and
CC is used to define the stencils on all the domains in the simulation.
CC Stencils, though identical, must be assigned individually for each
CC domain. This routine does that recursively for the domain 'd', and
CC all nests under 'd'.
CC
recursive subroutine define_data( d )
use domains_module
#include "rsl.inc"
type(domainstruct) :: d ! input: domain
integer decomp(3) ! local: how dimensions are decomposed
integer llen(3) ! local: local size in each dim.
integer glen(3) ! local: global size in each dim.
integer mesg ! local: a message definition
integer messages(8) ! local: message for each stencil pt.
integer k ! local: child index
C
decomp(1) = RSL_NORTHSOUTH ! m is decomposed by n/s processors
decomp(2) = RSL_EASTWEST ! n is decomposed by e/w processors
glen(1) = d%m ! global sizes set
glen(2) = d%n
llen(1) = d%mloc ! local sizes set
llen(2) = d%nloc
C
C Create and build a message descriptor containing the state array X.
C
call rsl_create_message( mesg )
call rsl_build_message( mesg,RSL_REAL,d%X,2,decomp,glen,llen )
C
C Create and build a stencil with the message on each of the 8 pts.
C
call rsl_create_stencil( d%sten )
messages(1) = mesg
messages(2) = mesg
messages(3) = mesg
messages(4) = mesg
messages(5) = mesg
messages(6) = mesg
messages(7) = mesg
messages(8) = mesg
call rsl_describe_stencil( d%domdesc, d%sten, RSL_8PT, messages )
C
C Define the stencils for all the child domains of this domains
C
do k = 1, d%nkids
call define_data( d%child(k) ) ! RECURSION
enddo
C
return
end
C
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
C
This document was generated using the LaTeX2HTML translator Version 0.5.3 (Wed Jan 26 1994) Copyright © 1993, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 11.tex.
The translation was initiated by michalak@mcs.anl.gov on Thu May 18 17:22:52 CDT 1995