MPI-2: Extensions to the Message-Passing Interface

Message Passing Interface Forum

This document describes the MPI-1.2 and MPI-2 standards. They are both extensions to the MPI-1.1 standard. The MPI-1.2 part of the document contains clarifications and corrections to the MPI-1.1 standard and defines MPI-1.2. The MPI-2 part of the document describes additions to the MPI-1 standard and defines MPI-2. These include miscellaneous topics, process creation and management, one-sided communications, extended collective operations, external interfaces, I/O, and additional language bindings.

(c) 1995, 1996, 1997 University of Tennessee, Knoxville, Tennessee. Permission to copy without fee all or part of this material is granted, provided the University of Tennessee copyright notice and the title of this document appear, and notice is given that copying is by permission of the University of Tennessee.

Acknowledgments

This document represents the work of many people who have served on the MPI Forum. The meetings have been attended by dozens of people from many parts of the world. It is the hard and dedicated work of this group that has led to the MPI standard.

The technical development was carried out by subgroups, whose work was reviewed by the full committee. During the period of development of the Message Passing Interface ( MPI-2), many people helped with this effort. Those who served as the primary coordinators are:


The following list includes some of the active participants who attended MPI-2 Forum meetings and are not mentioned above.

Greg Astfalk Robert Babb Ed Benson Rajesh Bordawekar
Pete Bradley Peter Brennan Ron Brightwell Maciej Brodowicz
Eric Brunner Greg Burns Margaret Cahir Pang Chen
Ying Chen Albert Cheng Yong Cho Joel Clark
Lyndon Clarke Laurie Costello Dennis Cottel Jim Cownie
Zhenqian Cui Suresh Damodaran-Kamal Raja Daoud Judith Devaney
David DiNucci Doug Doefler Jack Dongarra Terry Dontje
Nathan Doss Anne Elster Mark Fallon Karl Feind
Sam Fineberg Craig Fischberg Stephen Fleischman Ian Foster
Hubertus Franke Richard Frost Al Geist Robert George
David Greenberg John Hagedorn Kei Harada Leslie Hart
Shane Hebert Rolf Hempel Tom Henderson Alex Ho
Hans-Christian Hoppe Joefon Jann Terry Jones Karl Kesselman
Koichi Konishi Susan Kraus Steve Kubica Steve Landherr
Mario Lauria Mark Law Juan Leon Lloyd Lewins
Ziyang Lu Bob Madahar Peter Madams John May
Oliver McBryan Brian McCandless Tyce McLarty Thom McMahon
Harish Nag Nick Nevin Jarek Nieplocha Ron Oldfield
Peter Ossadnik Steve Otto Peter Pacheco Yoonho Park
Perry Partow Pratap Pattnaik Elsie Pierce Paul Pierce
Heidi Poxon Jean-Pierre Prost Boris Protopopov James Pruyve
Rolf Rabenseifner Joe Rieken Peter Rigsbee Tom Robey
Anna Rounbehler Nobutoshi Sagawa Arindam Saha Eric Salo
Darren Sanders Eric Sharakan Andrew Sherman Fred Shirley
Lance Shuler A. Gordon Smith Ian Stockdale David Taylor
Stephen Taylor Greg Tensa Rajeev Thakur Marydell Tholburn
Dick Treumann Simon Tsang Manuel Ujaldon David Walker
Jerrell Watts Klaus Wolf Parkson Wong Dave Wright

The MPI Forum also acknowledges and appreciates the valuable input from people via e-mail and in person.

The following institutions supported the MPI-2 effort through time and travel support for the people listed above.

Argonne National Laboratory
Bolt, Beranek, and Newman
California Institute of Technology
Center for Computing Sciences
Convex Computer Corporation
Cray Research
Digital Equipment Corporation
Dolphin Interconnect Solutions, Inc.
Edinburgh Parallel Computing Centre
General Electric Company
German National Research Center for Information Technology
Hewlett-Packard
Hitachi
Hughes Aircraft Company
Intel Corporation
International Business Machines
Khoral Research
Lawrence Livermore National Laboratory
Los Alamos National Laboratory
MPI Software Techology, Inc.
Mississippi State University
NEC Corporation
National Aeronautics and Space Administration
National Energy Research Scientific Computing Center
National Institute of Standards and Technology
National Oceanic and Atmospheric Adminstration
Oak Ridge National Laboratory
Ohio State University
PALLAS GmbH
Pacific Northwest National Laboratory
Pratt & Whitney
San Diego Supercomputer Center
Sanders, A Lockheed-Martin Company
Sandia National Laboratories
Schlumberger
Scientific Computing Associates, Inc.
Silicon Graphics Incorporated
Sky Computers
Sun Microsystems Computer Corporation
Syracuse University
The MITRE Corporation
Thinking Machines Corporation
United States Navy
University of Colorado
University of Denver
University of Houston
University of Illinois
University of Maryland
University of Notre Dame
University of San Fransisco
University of Stuttgart Computing Center
University of Wisconsin

MPI-2 operated on a very tight budget (in reality, it had no budget when the first meeting was announced). Many institutions helped the MPI-2 effort by supporting the efforts and travel of the members of the MPI Forum. Direct support was given by NSF and DARPA under NSF contract CDA-9115428 for travel by U.S. academic participants and Esprit under project HPC Standards (21111) for European participants.


Contents

  • Introduction to MPI-2
  • Background
  • Organization of this Document
  • MPI-2 Terms and Conventions
  • Document Notation
  • Naming Conventions
  • Procedure Specification
  • Semantic Terms
  • Data Types
  • Opaque Objects
  • Array Arguments
  • State
  • Named Constants
  • Choice
  • Addresses
  • File Offsets
  • Language Binding
  • Deprecated Names and Functions
  • Fortran Binding Issues
  • C Binding Issues
  • C++ Binding Issues
  • Processes
  • Error Handling
  • Implementation Issues
  • Independence of Basic Runtime Routines
  • Interaction with Signals
  • Examples
  • Version 1.2 of MPI
  • Version Number
  • MPI-1.0 and MPI-1.1 Clarifications
  • Clarification of MPI_INITIALIZED
  • Clarification of MPI_FINALIZE
  • Clarification of status after MPI_WAIT and MPI_TEST
  • Clarification of MPI_INTERCOMM_CREATE
  • Clarification of MPI_INTERCOMM_MERGE
  • Clarification of Binding of MPI_TYPE_SIZE
  • Clarification of MPI_REDUCE
  • Clarification of Error Behavior of Attribute Callback Functions
  • Clarification of MPI_PROBE and MPI_IPROBE
  • Minor Corrections
  • Miscellany
  • Portable MPI Process Startup
  • Passing NULL to MPI_Init
  • Version Number
  • Datatype Constructor MPI_TYPE_CREATE_INDEXED_BLOCK
  • Treatment of MPI_Status
  • Passing MPI_STATUS_IGNORE for Status
  • Non-destructive Test of status
  • Error Class for Invalid Keyval
  • Committing a Committed Datatype
  • Allowing User Functions at Process Termination
  • Determining Whether MPI Has Finished
  • The Info Object
  • Memory Allocation
  • Language Interoperability
  • Introduction
  • Assumptions
  • Initialization
  • Transfer of Handles
  • Status
  • MPI Opaque Objects
  • Datatypes
  • Callback Functions
  • Error Handlers
  • Reduce Operations
  • Addresses
  • Attributes
  • Extra State
  • Constants
  • Interlanguage Communication
  • Error Handlers
  • Error Handlers for Communicators
  • Error Handlers for Windows
  • Error Handlers for Files
  • New Datatype Manipulation Functions
  • Type Constructors with Explicit Addresses
  • Extent and Bounds of Datatypes
  • True Extent of Datatypes
  • Subarray Datatype Constructor
  • Distributed Array Datatype Constructor
  • New Predefined Datatypes
  • Wide Characters
  • Signed Characters and Reductions
  • Unsigned long long Type
  • Canonical MPI_PACK and MPI_UNPACK
  • Functions and Macros
  • Profiling Interface
  • Process Creation and Management
  • Introduction
  • The MPI-2 Process Model
  • Starting Processes
  • The Runtime Environment
  • Process Manager Interface
  • Processes in MPI
  • Starting Processes and Establishing Communication
  • Starting Multiple Executables and Establishing Communication
  • Reserved Keys
  • Spawn Example
  • Manager-worker Example, Using MPI_SPAWN.
  • Establishing Communication
  • Names, Addresses, Ports, and All That
  • Server Routines
  • Client Routines
  • Name Publishing
  • Reserved Key Values
  • Client/Server Examples
  • Simplest Example --- Completely Portable.
  • Ocean/Atmosphere - Relies on Name Publishing
  • Simple Client-Server Example.
  • Other Functionality
  • Universe Size
  • Singleton MPI_INIT
  • MPI_APPNUM
  • Releasing Connections
  • Another Way to Establish MPI Communication
  • One-Sided Communications
  • Introduction
  • Initialization
  • Window Creation
  • Window Attributes
  • Communication Calls
  • Put
  • Get
  • Examples
  • Accumulate Functions
  • Synchronization Calls
  • Fence
  • General Active Target Synchronization
  • Lock
  • Assertions
  • Miscellaneous Clarifications
  • Examples
  • Error Handling
  • Error Handlers
  • Error Classes
  • Semantics and Correctness
  • Atomicity
  • Progress
  • Registers and Compiler Optimizations
  • Extended Collective Operations
  • Introduction
  • Intercommunicator Constructors
  • Extended Collective Operations
  • Intercommunicator Collective Operations
  • Operations that Move Data
  • Broadcast
  • Gather
  • Scatter
  • ``All'' Forms and All-to-all
  • Reductions
  • Other Operations
  • Generalized All-to-all Function
  • Exclusive Scan
  • External Interfaces
  • Introduction
  • Generalized Requests
  • Examples
  • Associating Information with Status
  • Naming Objects
  • Error Classes, Error Codes, and Error Handlers
  • Decoding a Datatype
  • MPI and Threads
  • General
  • Clarifications
  • Initialization
  • New Attribute Caching Functions
  • Communicators
  • Windows
  • Datatypes
  • Duplicating a Datatype
  • I/O
  • Introduction
  • Definitions
  • File Manipulation
  • Opening a File
  • Closing a File
  • Deleting a File
  • Resizing a File
  • Preallocating Space for a File
  • Querying the Size of a File
  • Querying File Parameters
  • File Info
  • Reserved File Hints
  • File Views
  • Data Access
  • Data Access Routines
  • Positioning
  • Synchronism
  • Coordination
  • Data Access Conventions
  • Data Access with Explicit Offsets
  • Data Access with Individual File Pointers
  • Data Access with Shared File Pointers
  • Noncollective Operations
  • Collective Operations
  • Seek
  • Split Collective Data Access Routines
  • File Interoperability
  • Datatypes for File Interoperability
  • External Data Representation: ``external32''
  • User-Defined Data Representations
  • Extent Callback
  • Datarep Conversion Functions
  • Matching Data Representations
  • Consistency and Semantics
  • File Consistency
  • Random Access vs. Sequential Files
  • Progress
  • Collective File Operations
  • Type Matching
  • Miscellaneous Clarifications
  • MPI_Offset Type
  • Logical vs. Physical File Layout
  • File Size
  • Examples
  • Asynchronous I/O
  • I/O Error Handling
  • I/O Error Classes
  • Examples
  • Double Buffering with Split Collective I/O
  • Subarray Filetype Constructor
  • Language Bindings
  • C++
  • Overview
  • Design
  • C++ Classes for MPI
  • Class Member Functions for MPI
  • Semantics
  • C++ Datatypes
  • Communicators
  • Exceptions
  • Mixed-Language Operability
  • Profiling
  • Fortran Support
  • Overview
  • Problems With Fortran Bindings for MPI
  • Problems Due to Strong Typing
  • Problems Due to Data Copying and Sequence Association
  • Special Constants
  • Fortran 90 Derived Types
  • A Problem with Register Optimization
  • Basic Fortran Support
  • Extended Fortran Support
  • The mpi Module
  • No Type Mismatch Problems for Subroutines with Choice Arguments
  • Additional Support for Fortran Numeric Intrinsic Types
  • Parameterized Datatypes with Specified Precision and Exponent Range
  • Support for Size-specific MPI Datatypes
  • Communication With Size-specific Types
  • Bibliography
  • Language Binding
  • Introduction
  • Defined Constants, Error Codes, Info Keys, and Info Values
  • Defined Constants
  • Info Keys
  • Info Values
  • MPI-1.2 C Bindings
  • MPI-1.2 Fortran Bindings
  • MPI-1.2 C++ Bindings
  • MPI-2 C Bindings
  • Miscellany
  • Process Creation and Management
  • One-Sided Communications
  • Extended Collective Operations
  • External Interfaces
  • I/O
  • Language Bindings
  • MPI-2 C Functions
  • MPI-2 Fortran Bindings
  • Miscellany
  • Process Creation and Management
  • One-Sided Communications
  • Extended Collective Operations
  • External Interfaces
  • I/O
  • Language Bindings
  • MPI-2 Fortran Subroutines
  • MPI-2 C++ Bindings
  • Miscellany
  • Process Creation and Management
  • One-Sided Communications
  • Extended Collective Operations
  • External Interfaces
  • I/O
  • Language Bindings
  • MPI-2 C++ Functions
  • MPI-1 C++ Language Binding
  • C++ Classes
  • Defined Constants
  • Typedefs
  • C++ Bindings for Point-to-Point Communication
  • C++ Bindings for Collective Communication
  • C++ Bindings for Groups, Contexts, and Communicators
  • C++ Bindings for Process Topologies
  • C++ Bindings for Environmental Inquiry
  • C++ Bindings for Profiling
  • C++ Bindings for Status Access
  • C++ Bindings for New 1.2 Functions
  • C++ Bindings for Exceptions
  • C++ Bindings on all MPI Classes
  • Construction / Destruction
  • Copy / Assignment
  • Comparison
  • Inter-language Operability
  • Function Name Cross Reference
  • Index


  • 1. Introduction to MPI-2


    Up: Contents Next: Background



    Up: Contents Next: Background


    1.1. Background


    Up: Introduction to MPI-2 Next: Organization of this Document Previous: Introduction to MPI-2

    Beginning in March 1995, the MPI Forum began meeting to consider corrections and extensions to the original MPI Standard document [5]. The first product of these deliberations was Version 1.1 of the MPI specification, released in June of 1995 (see
    http://www.mpi-forum.org for official MPI document releases). Since that time, effort has been focused in five types of areas.

      1. Further corrections and clarifications for the MPI-1.1 document.
      2. Additions to MPI-1.1 that do not significantly change its types of functionality (new datatype constructors, language interoperability, etc.).
      3. Completely new types of functionality (dynamic processes, one-sided communication, parallel I/O, etc.) that are what everyone thinks of as ``MPI-2 functionality.''
      4. Bindings for Fortran 90 and C++. This document specifies C++ bindings for both MPI-1 and MPI-2 functions, and extensions to the Fortran 77 binding of MPI-1 and MPI-2 to handle Fortran 90 issues.
      5. Discussions of areas in which the MPI process and framework seem likely to be useful, but where more discussion and experience are needed before standardization (e.g. 0-copy semantics on shared-memory machines, real-time specifications).

    Corrections and clarifications (items of type 1 in the above list) have been collected in Chapter Version 1.2 of MPI of this document, ``Version 1.2 of MPI.'' This chapter also contains the function for identifying the version number. Additions to MPI-1.1 (items of types 2, 3, and 4 in the above list) are in the remaining chapters, and constitute the specification for MPI-2. This document specifies Version 2.0 of MPI. Items of type 5 in the above list have been moved to a separate document, the ``MPI Journal of Development'' (JOD), and are not part of the MPI-2 Standard.

    This structure makes it easy for users and implementors to understand what level of MPI compliance a given implementation has:


    It is to be emphasized that forward compatibility is preserved. That is, a valid MPI-1.1 program is both a valid MPI-1.2 program and a valid MPI-2 program, and a valid MPI-1.2 program is a valid MPI-2 program.



    Up: Introduction to MPI-2 Next: Organization of this Document Previous: Introduction to MPI-2


    1.2. Organization of this Document


    Up: Introduction to MPI-2 Next: MPI-2 Terms and Conventions Previous: Background

    This document is organized as follows:


    The rest of this document contains the MPI-2 Standard Specification. It adds substantial new types of functionality to MPI, in most cases specifying functions for an extended computational model (e.g., dynamic process creation and one-sided communication) or for a significant new capability (e.g., parallel I/O).

    The following is a list of the chapters in MPI-2, along with a brief description of each.


    The Appendices are:


    The MPI Function Index is a simple index showing the location of the precise definition of each MPI-2 function, together with C, C++, and Fortran bindings.

    MPI-2 provides various interfaces to facilitate interoperability of distinct MPI implementations. Among these are the canonical data representation for MPI I/O and for MPI_PACK_EXTERNAL and MPI_UNPACK_EXTERNAL. The definition of an actual binding of these interfaces that will enable interoperability is outside the scope of this document.

    A separate document consists of ideas that were discussed in the MPI Forum and deemed to have value, but are not included in the MPI Standard. They are part of the ``Journal of Development'' (JOD), lest good ideas be lost and in order to provide a starting point for further work. The chapters in the JOD are




    Up: Introduction to MPI-2 Next: MPI-2 Terms and Conventions Previous: Background


    2. MPI-2 Terms and Conventions


    Up: Contents Next: Document Notation Previous: Organization of this Document

    This chapter explains notational terms and conventions used throughout the MPI-2 document, some of the choices that have been made, and the rationale behind those choices. It is similar to the MPI-1 Terms and Conventions chapter but differs in some major and minor ways. Some of the major areas of difference are the naming conventions, some semantic definitions, file objects, Fortran 90 vs Fortran 77, C++, processes, and interaction with signals.



    Up: Contents Next: Document Notation Previous: Organization of this Document


    2.1. Document Notation


    Up: MPI-2 Terms and Conventions Next: Naming Conventions Previous: MPI-2 Terms and Conventions


    Rationale.

    Throughout this document, the rationale for the design choices made in the interface specification is set off in this format. Some readers may wish to skip these sections, while readers interested in interface design may want to read them carefully. ( End of rationale.)

    Advice to users.

    Throughout this document, material aimed at users and that illustrates usage is set off in this format. Some readers may wish to skip these sections, while readers interested in programming in MPI may want to read them carefully. ( End of advice to users.)

    Advice to implementors.

    Throughout this document, material that is primarily commentary to implementors is set off in this format. Some readers may wish to skip these sections, while readers interested in MPI implementations may want to read them carefully. ( End of advice to implementors.)



    Up: MPI-2 Terms and Conventions Next: Naming Conventions Previous: MPI-2 Terms and Conventions


    2.2. Naming Conventions


    Up: MPI-2 Terms and Conventions Next: Procedure Specification Previous: Document Notation

    MPI-1 used informal naming conventions. In many cases, MPI-1 names for C functions are of the form Class_action_subset and in Fortran of the form CLASS_ACTION_SUBSET, but this rule is not uniformly applied. In MPI-2, an attempt has been made to standardize names of new functions according to the following rules. In addition, the C++ bindings for MPI-1 functions also follow these rules (see Section C++ Binding Issues ). C and Fortran function names for MPI-1 have not been changed.

      1. In C, all routines associated with a particular type of MPI object should be of the form Class_action_subset or, if no subset exists, of the form Class_action. In Fortran, all routines associated with a particular type of MPI object should be of the form CLASS_ACTION_SUBSET or, if no subset exists, of the form CLASS_ACTION. For C and Fortran we use the C++ terminology to define the Class. In C++, the routine is a method on Class and is named MPI::Class::Action_subset. If the routine is associated with a certain class, but does not make sense as an object method, it is a static member function of the class.


      2. If the routine is not associated with a class, the name should be of the form Action_subset in C and ACTION_SUBSET in Fortran, and in C++ should be scoped in the MPI namespace, MPI::Action_subset.


      3. The names of certain actions have been standardized. In particular, Create creates a new object, Get retrieves information about an object, Set sets this information, Delete deletes information, Is asks whether or not an object has a certain property.

    C and Fortran names for MPI-1 functions violate these rules in several cases. The most common exceptions are the omission of the Class name from the routine and the omission of the Action where one can be inferred.

    MPI identifiers are limited to 30 characters (31 with the profiling interface). This is done to avoid exceeding the limit on some compilation systems.



    Up: MPI-2 Terms and Conventions Next: Procedure Specification Previous: Document Notation


    2.3. Procedure Specification


    Up: MPI-2 Terms and Conventions Next: Semantic Terms Previous: Naming Conventions

    MPI procedures are specified using a language-independent notation. The arguments of procedure calls are marked as IN, OUT or INOUT. The meanings of these are:


    There is one special case --- if an argument is a handle to an opaque object (these terms are defined in Section Opaque Objects ), and the object is updated by the procedure call, then the argument is marked OUT. It is marked this way even though the handle itself is not modified --- we use the OUT attribute to denote that what the handle references is updated. Thus, in C++, IN arguments are either references or pointers to const objects.


    Rationale.

    The definition of MPI tries to avoid, to the largest possible extent, the use of INOUT arguments, because such use is error-prone, especially for scalar arguments. ( End of rationale.)
    MPI's use of IN, OUT and INOUT is intended to indicate to the user how an argument is to be used, but does not provide a rigorous classification that can be translated directly into all language bindings (e.g., INTENT in Fortran 90 bindings or const in C bindings). For instance, the ``constant'' MPI_BOTTOM can usually be passed to OUT buffer arguments. Similarly, MPI_STATUS_IGNORE can be passed as the OUT status argument.

    A common occurrence for MPI functions is an argument that is used as IN by some processes and OUT by other processes. Such an argument is, syntactically, an INOUT argument and is marked as such, although, semantically, it is not used in one call both for input and for output on a single process.

    Another frequent situation arises when an argument value is needed only by a subset of the processes. When an argument is not significant at a process then an arbitrary value can be passed as an argument.

    Unless specified otherwise, an argument of type OUT or type INOUT cannot be aliased with any other argument passed to an MPI procedure. An example of argument aliasing in C appears below. If we define a C procedure like this,

    void copyIntBuffer( int *pin, int *pout, int len ) 
    {   int i; 
        for (i=0; i<len; ++i) *pout++ = *pin++; 
    } 
    
    then a call to it in the following code fragment has aliased arguments.
    int a[10]; 
    copyIntBuffer( a, a+3, 7); 
    
    Although the C language allows this, such usage of MPI procedures is forbidden unless otherwise specified. Note that Fortran prohibits aliasing of arguments.

    All MPI functions are first specified in the language-independent notation. Immediately below this, the ANSI C version of the function is shown followed by a version of the same function in Fortran and then the C++ binding. Fortran in this document refers to Fortran 90; see Section Language Binding .



    Up: MPI-2 Terms and Conventions Next: Semantic Terms Previous: Naming Conventions


    2.4. Semantic Terms


    Up: MPI-2 Terms and Conventions Next: Data Types Previous: Procedure Specification

    When discussing MPI procedures the following semantic terms are used.

    { nonblocking}
    A procedure is nonblocking if the procedure may return before the operation completes, and before the user is allowed to reuse resources (such as buffers) specified in the call. A nonblocking request is started by the call that initiates it, e.g., MPI_ISEND. The word complete is used with respect to operations, requests, and communications. An operation completes when the user is allowed to reuse resources, and any output buffers have been updated; i.e. a call to MPI_TEST will return flag = true. A request is completed by a call to wait, which returns, or a test or get status call which returns flag = true. This completing call has two effects: the status is extracted from the request; in the case of test and wait, if the request was nonpersistent, it is freed. A communication completes when all participating operations complete.
    { blocking}
    A procedure is blocking if return from the procedure indicates the user is allowed to reuse resources specified in the call.
    { local}
    A procedure is local if completion of the procedure depends only on the local executing process.
    { non-local}
    A procedure is non-local if completion of the operation may require the execution of some MPI procedure on another process. Such an operation may require communication occurring with another user process.
    { collective}
    A procedure is collective if all processes in a process group need to invoke the procedure. A collective call may or may not be synchronizing. Collective calls over the same communicator must be executed in the same order by all members of the process group.
    { predefined}
    A predefined datatype is a datatype with a predefined (constant) name (such as MPI_INT, MPI_FLOAT_INT, or MPI_UB) or a datatype constructed with MPI_TYPE_CREATE_F90_INTEGER, MPI_TYPE_CREATE_F90_REAL, or MPI_TYPE_CREATE_F90_COMPLEX. The former are named whereas the latter are unnamed.
    { derived}
    A derived datatype is any datatype that is not predefined.
    { portable}
    A datatype is portable, if it is a predefined datatype, or it is derived from a portable datatype using only the type constructors MPI_TYPE_CONTIGUOUS, MPI_TYPE_VECTOR, MPI_TYPE_INDEXED, MPI_TYPE_INDEXED_BLOCK, MPI_TYPE_CREATE_SUBARRAY, MPI_TYPE_DUP, and MPI_TYPE_CREATE_DARRAY. Such a datatype is portable because all displacements in the datatype are in terms of extents of one predefined datatype. Therefore, if such a datatype fits a data layout in one memory, it will fit the corresponding data layout in another memory, if the same declarations were used, even if the two systems have different architectures. On the other hand, if a datatype was constructed using MPI_TYPE_CREATE_HINDEXED, MPI_TYPE_CREATE_HVECTOR or MPI_TYPE_CREATE_STRUCT, then the datatype contains explicit byte displacements (e.g., providing padding to meet alignment restrictions). These displacements are unlikely to be chosen correctly if they fit data layout on one memory, but are used for data layouts on another process, running on a processor with a different architecture.
    { equivalent}
    Two datatypes are equivalent if they appear to have been created with the same sequence of calls (and arguments) and thus have the same typemap. Two equivalent datatypes do not necessarily have the same cached attributes or the same names.



    Up: MPI-2 Terms and Conventions Next: Data Types Previous: Procedure Specification


    2.5. Data Types


    Up: MPI-2 Terms and Conventions Next: Opaque Objects Previous: Semantic Terms



    Up: MPI-2 Terms and Conventions Next: Opaque Objects Previous: Semantic Terms


    2.5.1. Opaque Objects


    Up: Data Types Next: Array Arguments Previous: Data Types

    MPI manages system memory that is used for buffering messages and for storing internal representations of various MPI objects such as groups, communicators, datatypes, etc. This memory is not directly accessible to the user, and objects stored there are opaque: their size and shape is not visible to the user. Opaque objects are accessed via handles, which exist in user space. MPI procedures that operate on opaque objects are passed handle arguments to access these objects. In addition to their use by MPI calls for object access, handles can participate in assignments and comparisons.

    In Fortran, all handles have type INTEGER. In C and C++, a different handle type is defined for each category of objects. In addition, handles themselves are distinct objects in C++. The C and C++ types must support the use of the assignment and equality operators.


    Advice to implementors.

    In Fortran, the handle can be an index into a table of opaque objects in a system table; in C it can be such an index or a pointer to the object. C++ handles can simply ``wrap up'' a table index or pointer.

    ( End of advice to implementors.)
    Opaque objects are allocated and deallocated by calls that are specific to each object type. These are listed in the sections where the objects are described. The calls accept a handle argument of matching type. In an allocate call this is an OUT argument that returns a valid reference to the object. In a call to deallocate this is an INOUT argument which returns with an ``invalid handle'' value. MPI provides an ``invalid handle'' constant for each object type. Comparisons to this constant are used to test for validity of the handle.

    A call to a deallocate routine invalidates the handle and marks the object for deallocation. The object is not accessible to the user after the call. However, MPI need not deallocate the object immediately. Any operation pending (at the time of the deallocate) that involves this object will complete normally; the object will be deallocated afterwards.

    An opaque object and its handle are significant only at the process where the object was created and cannot be transferred to another process.

    MPI provides certain predefined opaque objects and predefined, static handles to these objects. The user must not free such objects. In C++, this is enforced by declaring the handles to these predefined objects to be static const.


    Rationale.

    This design hides the internal representation used for MPI data structures, thus allowing similar calls in C, C++, and Fortran. It also avoids conflicts with the typing rules in these languages, and easily allows future extensions of functionality. The mechanism for opaque objects used here loosely follows the POSIX Fortran binding standard.

    The explicit separation of handles in user space and objects in system space allows space-reclaiming and deallocation calls to be made at appropriate points in the user program. If the opaque objects were in user space, one would have to be very careful not to go out of scope before any pending operation requiring that object completed. The specified design allows an object to be marked for deallocation, the user program can then go out of scope, and the object itself still persists until any pending operations are complete.

    The requirement that handles support assignment/comparison is made since such operations are common. This restricts the domain of possible implementations. The alternative would have been to allow handles to have been an arbitrary, opaque type. This would force the introduction of routines to do assignment and comparison, adding complexity, and was therefore ruled out. ( End of rationale.)

    Advice to users.

    A user may accidently create a dangling reference by assigning to a handle the value of another handle, and then deallocating the object associated with these handles. Conversely, if a handle variable is deallocated before the associated object is freed, then the object becomes inaccessible (this may occur, for example, if the handle is a local variable within a subroutine, and the subroutine is exited before the associated object is deallocated). It is the user's responsibility to avoid adding or deleting references to opaque objects, except as a result of MPI calls that allocate or deallocate such objects. ( End of advice to users.)

    Advice to implementors.

    The intended semantics of opaque objects is that opaque objects are separate from one another; each call to allocate such an object copies all the information required for the object. Implementations may avoid excessive copying by substituting referencing for copying. For example, a derived datatype may contain references to its components, rather then copies of its components; a call to MPI_COMM_GROUP may return a reference to the group associated with the communicator, rather than a copy of this group. In such cases, the implementation must maintain reference counts, and allocate and deallocate objects in such a way that the visible effect is as if the objects were copied. ( End of advice to implementors.)



    Up: Data Types Next: Array Arguments Previous: Data Types


    2.5.2. Array Arguments


    Up: Data Types Next: State Previous: Opaque Objects

    An MPI call may need an argument that is an array of opaque objects, or an array of handles. The array-of-handles is a regular array with entries that are handles to objects of the same type in consecutive locations in the array. Whenever such an array is used, an additional len argument is required to indicate the number of valid entries (unless this number can be derived otherwise). The valid entries are at the beginning of the array; len indicates how many of them there are, and need not be the size of the entire array. The same approach is followed for other array arguments. In some cases NULL handles are considered valid entries. When a NULL argument is desired for an array of statuses, one uses MPI_STATUSES_IGNORE.



    Up: Data Types Next: State Previous: Opaque Objects


    2.5.3. State


    Up: Data Types Next: Named Constants Previous: Array Arguments

    MPI procedures use at various places arguments with state types. The values of such a data type are all identified by names, and no operation is defined on them. For example, the MPI_TYPE_CREATE_SUBARRAY routine has a state argument order with values MPI_ORDER_C and MPI_ORDER_FORTRAN.



    Up: Data Types Next: Named Constants Previous: Array Arguments


    2.5.4. Named Constants


    Up: Data Types Next: Choice Previous: State

    MPI procedures sometimes assign a special meaning to a special value of a basic type argument; e.g., tag is an integer-valued argument of point-to-point communication operations, with a special wild-card value, MPI_ANY_TAG. Such arguments will have a range of regular values, which is a proper subrange of the range of values of the corresponding basic type; special values (such as MPI_ANY_TAG) will be outside the regular range. The range of regular values, such as tag, can be queried using environmental inquiry functions (Chapter 7 of the MPI-1 document). The range of other values, such as source, depends on values given by other MPI routines (in the case of source it is the communicator size).

    MPI also provides predefined named constant handles, such as MPI_COMM_WORLD.

    All named constants, with the exceptions noted below for Fortran, can be used in initialization expressions or assignments. These constants do not change values during execution. Opaque objects accessed by constant handles are defined and do not change value between MPI initialization ( MPI_INIT) and MPI completion ( MPI_FINALIZE).

    The constants that cannot be used in initialization expressions or assignments in Fortran are:

      MPI_BOTTOM 
      MPI_STATUS_IGNORE 
      MPI_STATUSES_IGNORE 
      MPI_ERRCODES_IGNORE 
      MPI_IN_PLACE 
      MPI_ARGV_NULL 
      MPI_ARGVS_NULL 
    

    Advice to implementors.

    In Fortran the implementation of these special constants may require the use of language constructs that are outside the Fortran standard. Using special values for the constants (e.g., by defining them through parameter statements) is not possible because an implementation cannot distinguish these values from legal data. Typically, these constants are implemented as predefined static variables (e.g., a variable in an MPI-declared COMMON block), relying on the fact that the target compiler passes data by address. Inside the subroutine, this address can be extracted by some mechanism outside the Fortran standard (e.g., by Fortran extensions or by implementing the function in C). ( End of advice to implementors.)



    Up: Data Types Next: Choice Previous: State


    2.5.5. Choice


    Up: Data Types Next: Addresses Previous: Named Constants

    MPI functions sometimes use arguments with a choice (or union) data type. Distinct calls to the same routine may pass by reference actual arguments of different types. The mechanism for providing such arguments will differ from language to language. For Fortran, the document uses <type> to represent a choice variable; for C and C++, we use void *.



    Up: Data Types Next: Addresses Previous: Named Constants


    2.5.6. Addresses


    Up: Data Types Next: File Offsets Previous: Choice

    Some MPI procedures use address arguments that represent an absolute address in the calling program. The datatype of such an argument is MPI_Aint in C, MPI::Aint in C++ and INTEGER (KIND=MPI_ADDRESS_KIND) in Fortran. There is the MPI constant MPI_BOTTOM to indicate the start of the address range.



    Up: Data Types Next: File Offsets Previous: Choice


    2.5.7. File Offsets


    Up: Data Types Next: Language Binding Previous: Addresses

    For I/O there is a need to give the size, displacement, and offset into a file. These quantities can easily be larger than 32 bits which can be the default size of a Fortran integer. To overcome this, these quantities are declared to be INTEGER (KIND=MPI_OFFSET_KIND) in Fortran. In C one uses MPI_Offset whereas in C++ one uses MPI::Offset.



    Up: Data Types Next: Language Binding Previous: Addresses


    2.6. Language Binding


    Up: MPI-2 Terms and Conventions Next: Deprecated Names and Functions Previous: File Offsets

    This section defines the rules for MPI language binding in general and for Fortran, ANSI C, and C++, in particular. (Note that ANSI C has been replaced by ISO C. References in MPI to ANSI C now mean ISO C.) Defined here are various object representations, as well as the naming conventions used for expressing this standard. The actual calling sequences are defined elsewhere.

    MPI bindings are for Fortran 90, though they are designed to be usable in Fortran 77 environments.

    Since the word PARAMETER is a keyword in the Fortran language, we use the word ``argument'' to denote the arguments to a subroutine. These are normally referred to as parameters in C and C++, however, we expect that C and C++ programmers will understand the word ``argument'' (which has no specific meaning in C/C++), thus allowing us to avoid unnecessary confusion for Fortran programmers.

    Since Fortran is case insensitive, linkers may use either lower case or upper case when resolving Fortran names. Users of case sensitive languages should avoid the ``mpi_'' and ``pmpi_'' prefixes.



    Up: MPI-2 Terms and Conventions Next: Deprecated Names and Functions Previous: File Offsets


    2.6.1. Deprecated Names and Functions


    Up: Language Binding Next: Fortran Binding Issues Previous: Language Binding

    A number of chapters refer to deprecated or replaced MPI-1 constructs. These are constructs that continue to be part of the MPI standard, but that users are recommended not to continue using, since MPI-2 provides better solutions. For example, the Fortran binding for MPI-1 functions that have address arguments uses INTEGER. This is not consistent with the C binding, and causes problems on machines with 32 bit INTEGERs and 64 bit addresses. In MPI-2, these functions have new names, and new bindings for the address arguments. The use of the old functions is deprecated. For consistency, here and a few other cases, new C functions are also provided, even though the new functions are equivalent to the old functions. The old names are deprecated. Another example is provided by the MPI-1 predefined datatypes MPI_UB and MPI_LB. They are deprecated, since their use is awkward and error-prone, while the MPI-2 function MPI_TYPE_CREATE_RESIZED provides a more convenient mechanism to achieve the same effect.

    The following is a list of all of the deprecated constructs. Note that the constants MPI_LB and MPI_UB are replaced by the function MPI_TYPE_CREATE_RESIZED; this is because their principle use was as input datatypes to MPI_TYPE_STRUCT to create resized datatypes. Also note that some C typedefs and Fortran subroutine names are included in this list; they are the types of callback functions.

    Deprecated MPI-2 Replacement
    MPI_ADDRESS MPI_GET_ADDRESS
    MPI_TYPE_HINDEXED MPI_TYPE_CREATE_HINDEXED
    MPI_TYPE_HVECTOR MPI_TYPE_CREATE_HVECTOR
    MPI_TYPE_STRUCT MPI_TYPE_CREATE_STRUCT
    MPI_TYPE_EXTENT MPI_TYPE_GET_EXTENT
    MPI_TYPE_UB MPI_TYPE_GET_EXTENT
    MPI_TYPE_LB MPI_TYPE_GET_EXTENT
    MPI_LB MPI_TYPE_CREATE_RESIZED
    MPI_UB MPI_TYPE_CREATE_RESIZED
    MPI_ERRHANDLER_CREATE MPI_COMM_CREATE_ERRHANDLER
    MPI_ERRHANDLER_GET MPI_COMM_GET_ERRHANDLER
    MPI_ERRHANDLER_SET MPI_COMM_SET_ERRHANDLER
    MPI_Handler_function MPI_Comm_errhandler_fn
    MPI_KEYVAL_CREATE MPI_COMM_CREATE_KEYVAL
    MPI_KEYVAL_FREE MPI_COMM_FREE_KEYVAL
    MPI_DUP_FN MPI_COMM_DUP_FN
    MPI_NULL_COPY_FN MPI_COMM_NULL_COPY_FN
    MPI_NULL_DELETE_FN MPI_COMM_NULL_DELETE_FN
    MPI_Copy_function MPI_Comm_copy_attr_function
    COPY_FUNCTION COMM_COPY_ATTR_FN
    MPI_Delete_function MPI_Comm_delete_attr_function
    DELETE_FUNCTION COMM_DELETE_ATTR_FN
    MPI_ATTR_DELETE MPI_COMM_DELETE_ATTR
    MPI_ATTR_GET MPI_COMM_GET_ATTR
    MPI_ATTR_PUT MPI_COMM_SET_ATTR



    Up: Language Binding Next: Fortran Binding Issues Previous: Language Binding


    2.6.2. Fortran Binding Issues


    Up: Language Binding Next: C Binding Issues Previous: Deprecated Names and Functions

    MPI-1.1 provided bindings for Fortran 77. MPI-2 retains these bindings but they are now interpreted in the context of the Fortran 90 standard. MPI can still be used with most Fortran 77 compilers, as noted below. When the term Fortran is used it means Fortran 90.

    All MPI names have an MPI_ prefix, and all characters are capitals. Programs must not declare variables, parameters, or functions with names beginning with the prefix MPI_. To avoid conflicting with the profiling interface, programs should also avoid functions with the prefix PMPI_. This is mandated to avoid possible name collisions.

    All MPI Fortran subroutines have a return code in the last argument. A few MPI operations which are functions do not have the return code argument. The return code value for successful completion is MPI_SUCCESS. Other error codes are implementation dependent; see the error codes in Chapter 7 of the MPI-1 document and Annex Language Binding in the MPI-2 document.

    Constants representing the maximum length of a string are one smaller in Fortran than in C and C++ as discussed in Section Constants .

    Handles are represented in Fortran as INTEGERs. Binary-valued variables are of type LOGICAL.

    Array arguments are indexed from one.

    The MPI Fortran binding is inconsistent with the Fortran 90 standard in several respects. These inconsistencies, such as register optimization problems, have implications for user codes that are discussed in detail in Section A Problem with Register Optimization . They are also inconsistent with Fortran 77.


    Additionally, MPI is inconsistent with Fortran 77 in a number of ways, as noted below.



    Up: Language Binding Next: C Binding Issues Previous: Deprecated Names and Functions


    2.6.3. C Binding Issues


    Up: Language Binding Next: C++ Binding Issues Previous: Fortran Binding Issues

    We use the ANSI C declaration format. All MPI names have an MPI_ prefix, defined constants are in all capital letters, and defined types and functions have one capital letter after the prefix. Programs must not declare variables or functions with names beginning with the prefix MPI_. To support the profiling interface, programs should not declare functions with names beginning with the prefix PMPI_.

    The definition of named constants, function prototypes, and type definitions must be supplied in an include file mpi.h.

    Almost all C functions return an error code. The successful return code will be MPI_SUCCESS, but failure return codes are implementation dependent.

    Type declarations are provided for handles to each category of opaque objects.

    Array arguments are indexed from zero.

    Logical flags are integers with value 0 meaning ``false'' and a non-zero value meaning ``true.''

    Choice arguments are pointers of type void *.

    Address arguments are of MPI defined type MPI_Aint. File displacements are of type MPI_Offset. MPI_Aint is defined to be an integer of the size needed to hold any valid address on the target architecture. MPI_Offset is defined to be an integer of the size needed to hold any valid file size on the target architecture.



    Up: Language Binding Next: C++ Binding Issues Previous: Fortran Binding Issues


    2.6.4. C++ Binding Issues


    Up: Language Binding Next: Processes Previous: C Binding Issues

    There are places in the standard that give rules for C and not for C++. In these cases, the C rule should be applied to the C++ case, as appropriate. In particular, the values of constants given in the text are the ones for C and Fortran. A cross index of these with the C++ names is given in Annex Language Binding .

    We use the ANSI C++ declaration format. All MPI names are declared within the scope of a namespace called MPI and therefore are referenced with an MPI:: prefix. Defined constants are in all capital letters, and class names, defined types, and functions have only their first letter capitalized. Programs must not declare variables or functions in the MPI namespace. This is mandated to avoid possible name collisions.

    The definition of named constants, function prototypes, and type definitions must be supplied in an include file mpi.h.


    Advice to implementors.

    The file mpi.h may contain both the C and C++ definitions. Usually one can simply use the defined value (generally __cplusplus, but not required) to see if one is using C++ to protect the C++ definitions. It is possible that a C compiler will require that the source protected this way be legal C code. In this case, all the C++ definitions can be placed in a different include file and the ``#include'' directive can be used to include the necessary C++ definitions in the mpi.h file. ( End of advice to implementors.)
    C++ functions that create objects or return information usually place the object or information in the return value. Since the language neutral prototypes of MPI functions include the C++ return value as an OUT parameter, semantic descriptions of MPI functions refer to the C++ return value by that parameter name (see Section Function Name Cross Reference ). The remaining C++ functions return void.

    In some circumstances, MPI permits users to indicate that they do not want a return value. For example, the user may indicate that the status is not filled in. Unlike C and Fortran where this is achieved through a special input value, in C++ this is done by having two bindings where one has the optional argument and one does not.

    C++ functions do not return error codes. If the default error handler has been set to MPI::ERRORS_THROW_EXCEPTIONS, the C++ exception mechanism is used to signal an error by throwing an MPI::Exception object.

    It should be noted that the default error handler (i.e., MPI::ERRORS_ARE_FATAL) on a given type has not changed. User error handlers are also permitted. MPI::ERRORS_RETURN simply returns control to the calling function; there is no provision for the user to retrieve the error code.

    User callback functions that return integer error codes should not throw exceptions; the returned error will be handled by the MPI implementation by invoking the appropriate error handler.


    Advice to users.

    C++ programmers that want to handle MPI errors on their own should use the MPI::ERRORS_THROW_EXCEPTIONS error handler, rather than MPI::ERRORS_RETURN, that is used for that purpose in C. Care should be taken using exceptions in mixed language situations. ( End of advice to users.)
    Opaque object handles must be objects in themselves, and have the assignment and equality operators overridden to perform semantically like their C and Fortran counterparts.

    Array arguments are indexed from zero.

    Logical flags are of type bool.

    Choice arguments are pointers of type void *.

    Address arguments are of MPI-defined integer type MPI::Aint, defined to be an integer of the size needed to hold any valid address on the target architecture. Analogously, MPI::Offset is an integer to hold file offsets.

    Most MPI functions are methods of MPI C++ classes. MPI class names are generated from the language neutral MPI types by dropping the MPI_ prefix and scoping the type within the MPI namespace. For example, MPI_DATATYPE becomes MPI::Datatype.

    The names of MPI-2 functions generally follow the naming rules given. In some circumstances, the new MPI-2 function is related to an MPI-1 function with a name that does not follow the naming conventions. In this circumstance, the language neutral name is in analogy to the MPI-1 name even though this gives an MPI-2 name that violates the naming conventions. The C and Fortran names are the same as the language neutral name in this case. However, the C++ names for MPI-1 do reflect the naming rules and can differ from the C and Fortran names. Thus, the analogous name in C++ to the MPI-1 name is different than the language neutral name. This results in the C++ name differing from the language neutral name. An example of this is the language neutral name of MPI_FINALIZED and a C++ name of MPI::Is_finalized.

    In C++, function typedefs are made publicly within appropriate classes. However, these declarations then become somewhat cumbersome, as with the following:

    typedef MPI::Grequest::Query_function();

    would look like the following:


    namespace MPI { 
      class Request { 
        // ... 
      }; 
     
      class Grequest : public MPI::Request { 
        // ... 
        typedef Query_function(void* extra_state, MPI::Status& status); 
      }; 
    }; 
    
    Rather than including this scaffolding when declaring C++ typedefs, we use an abbreviated form. In particular, we explicitly indicate the class and namespace scope for the typedef of the function. Thus, the example above is shown in the text as follows:
    typedef int MPI::Grequest::Query_function(void* extra_state, 
                                              MPI::Status& status) 
    

    The C++ bindings presented in Annex MPI-1 C++ Language Binding and throughout this document were generated by applying a simple set of name generation rules to the MPI function specifications. While these guidelines may be sufficient in most cases, they may not be suitable for all situations. In cases of ambiguity or where a specific semantic statement is desired, these guidelines may be superseded as the situation dictates.

      1. All functions, types, and constants are declared within the scope of a namespace called MPI.


      2. Arrays of MPI handles are always left in the argument list (whether they are IN or OUT arguments).


      3. If the argument list of an MPI function contains a scalar IN handle, and it makes sense to define the function as a method of the object corresponding to that handle, the function is made a member function of the corresponding MPI class. The member functions are named according to the corresponding MPI function name, but without the `` MPI_'' prefix and without the object name prefix (if applicable). In addition:

        1. The scalar IN handle is dropped from the argument list, and this corresponds to the dropped argument.


        2. The function is declared const.


      4. MPI functions are made into class functions (static) when they belong on a class but do not have a unique scalar IN or INOUT parameter of that class.


      5. If the argument list contains a single OUT argument that is not of type MPI_STATUS (or an array), that argument is dropped from the list and the function returns that value.


      Example The C++ binding for MPI_COMM_SIZE is int MPI::Comm::Get_size(void) const.


      6. If there are multiple OUT arguments in the argument list, one is chosen as the return value and is removed from the list.


      7. If the argument list does not contain any OUT arguments, the function returns void.


      Example The C++ binding for MPI_REQUEST_FREE is void MPI::Request::Free(void)


      8. MPI functions to which the above rules do not apply are not members of any class, but are defined in the MPI namespace.


      Example The C++ binding for MPI_BUFFER_ATTACH is void MPI::Attach_buffer(void* buffer, int size).


      9. All class names, defined types, and function names have only their first letter capitalized. Defined constants are in all capital letters.


      10. Any IN pointer, reference, or array argument must be declared const.


      11. Handles are passed by reference.


      12. Array arguments are denoted with square brackets ( []), not pointers, as this is more semantically precise.



    Up: Language Binding Next: Processes Previous: C Binding Issues


    2.7. Processes


    Up: MPI-2 Terms and Conventions Next: Error Handling Previous: C++ Binding Issues

    An MPI program consists of autonomous processes, executing their own code, in a MIMD style. The codes executed by each process need not be identical. The processes communicate via calls to MPI communication primitives. Typically, each process executes in its own address space, although shared-memory implementations of MPI are possible.

    This document specifies the behavior of a parallel program assuming that only MPI calls are used. The interaction of an MPI program with other possible means of communication, I/O, and process management is not specified. Unless otherwise stated in the specification of the standard, MPI places no requirements on the result of its interaction with external mechanisms that provide similar or equivalent functionality. This includes, but is not limited to, interactions with external mechanisms for process control, shared and remote memory access, file system access and control, interprocess communication, process signaling, and terminal I/O. High quality implementations should strive to make the results of such interactions intuitive to users, and attempt to document restrictions where deemed necessary.


    Advice to implementors.

    Implementations that support such additional mechanisms for functionality supported within MPI are expected to document how these interact with MPI. ( End of advice to implementors.)
    The interaction of MPI and threads is defined in Section MPI and Threads .



    Up: MPI-2 Terms and Conventions Next: Error Handling Previous: C++ Binding Issues


    2.8. Error Handling


    Up: MPI-2 Terms and Conventions Next: Implementation Issues Previous: Processes

    MPI provides the user with reliable message transmission. A message sent is always received correctly, and the user does not need to check for transmission errors, time-outs, or other error conditions. In other words, MPI does not provide mechanisms for dealing with failures in the communication system. If the MPI implementation is built on an unreliable underlying mechanism, then it is the job of the implementor of the MPI subsystem to insulate the user from this unreliability, or to reflect unrecoverable errors as failures. Whenever possible, such failures will be reflected as errors in the relevant communication call. Similarly, MPI itself provides no mechanisms for handling processor failures.

    Of course, MPI programs may still be erroneous. A program error can occur when an MPI call is made with an incorrect argument (non-existing destination in a send operation, buffer too small in a receive operation, etc.). This type of error would occur in any implementation. In addition, a resource error may occur when a program exceeds the amount of available system resources (number of pending messages, system buffers, etc.). The occurrence of this type of error depends on the amount of available resources in the system and the resource allocation mechanism used; this may differ from system to system. A high-quality implementation will provide generous limits on the important resources so as to alleviate the portability problem this represents.

    In C and Fortran, almost all MPI calls return a code that indicates successful completion of the operation. Whenever possible, MPI calls return an error code if an error occurred during the call. By default, an error detected during the execution of the MPI library causes the parallel computation to abort, except for file operations. However, MPI provides mechanisms for users to change this default and to handle recoverable errors. The user may specify that no error is fatal, and handle error codes returned by MPI calls by himself or herself. Also, the user may provide his or her own error-handling routines, which will be invoked whenever an MPI call returns abnormally. The MPI error handling facilities are described in Chapter 7 of the MPI-1 document and in Section Error Handlers of this document. The return values of C++ functions are not error codes. If the default error handler has been set to MPI::ERRORS_THROW_EXCEPTIONS, the C++ exception mechanism is used to signal an error by throwing an MPI::Exception object.

    Several factors limit the ability of MPI calls to return with meaningful error codes when an error occurs. MPI may not be able to detect some errors; other errors may be too expensive to detect in normal execution mode; finally some errors may be ``catastrophic'' and may prevent MPI from returning control to the caller in a consistent state.

    Another subtle issue arises because of the nature of asynchronous communications: MPI calls may initiate operations that continue asynchronously after the call returned. Thus, the operation may return with a code indicating successful completion, yet later cause an error exception to be raised. If there is a subsequent call that relates to the same operation (e.g., a call that verifies that an asynchronous operation has completed) then the error argument associated with this call will be used to indicate the nature of the error. In a few cases, the error may occur after all calls that relate to the operation have completed, so that no error value can be used to indicate the nature of the error (e.g., an error on the receiver in a send with the ready mode). Such an error must be treated as fatal, since information cannot be returned for the user to recover from it.

    This document does not specify the state of a computation after an erroneous MPI call has occurred. The desired behavior is that a relevant error code be returned, and the effect of the error be localized to the greatest possible extent. E.g., it is highly desirable that an erroneous receive call will not cause any part of the receiver's memory to be overwritten, beyond the area specified for receiving the message.

    Implementations may go beyond this document in supporting in a meaningful manner MPI calls that are defined here to be erroneous. For example, MPI specifies strict type matching rules between matching send and receive operations: it is erroneous to send a floating point variable and receive an integer. Implementations may go beyond these type matching rules, and provide automatic type conversion in such situations. It will be helpful to generate warnings for such non-conforming behavior.

    MPI-2 defines a way for users to create new error codes as defined in Section Error Classes, Error Codes, and Error Handlers .



    Up: MPI-2 Terms and Conventions Next: Implementation Issues Previous: Processes


    2.9. Implementation Issues


    Up: MPI-2 Terms and Conventions Next: Independence of Basic Runtime Routines Previous: Error Handling

    There are a number of areas where an MPI implementation may interact with the operating environment and system. While MPI does not mandate that any services (such as signal handling) be provided, it does strongly suggest the behavior to be provided if those services are available. This is an important point in achieving portability across platforms that provide the same set of services.



    Up: MPI-2 Terms and Conventions Next: Independence of Basic Runtime Routines Previous: Error Handling


    2.9.1. Independence of Basic Runtime Routines


    Up: Implementation Issues Next: Interaction with Signals Previous: Implementation Issues

    MPI programs require that library routines that are part of the basic language environment (such as write in Fortran and printf and malloc in ANSI C) and are executed after MPI_INIT and before MPI_FINALIZE operate independently and that their completion is independent of the action of other processes in an MPI program.

    Note that this in no way prevents the creation of library routines that provide parallel services whose operation is collective. However, the following program is expected to complete in an ANSI C environment regardless of the size of MPI_COMM_WORLD (assuming that printf is available at the executing nodes).

    int rank; 
    MPI_Init((void *)0, (void *)0); 
    MPI_Comm_rank(MPI_COMM_WORLD, &rank); 
    if (rank == 0) printf("Starting program\n"); 
    MPI_Finalize(); 
    
    The corresponding Fortran and C++ programs are also expected to complete.

    An example of what is not required is any particular ordering of the action of these routines when called by several tasks. For example, MPI makes neither requirements nor recommendations for the output from the following program (again assuming that I/O is available at the executing nodes).

    MPI_Comm_rank(MPI_COMM_WORLD, &rank); 
    printf("Output from task rank %d\n", rank); 
    
    In addition, calls that fail because of resource exhaustion or other error are not considered a violation of the requirements here (however, they are required to complete, just not to complete successfully).



    Up: Implementation Issues Next: Interaction with Signals Previous: Implementation Issues


    2.9.2. Interaction with Signals


    Up: Implementation Issues Next: Examples Previous: Independence of Basic Runtime Routines

    MPI does not specify the interaction of processes with signals and does not require that MPI be signal safe. The implementation may reserve some signals for its own use. It is required that the implementation document which signals it uses, and it is strongly recommended that it not use SIGALRM, SIGFPE, or SIGIO. Implementations may also prohibit the use of MPI calls from within signal handlers.

    In multithreaded environments, users can avoid conflicts between signals and the MPI library by catching signals only on threads that do not execute MPI calls. High quality single-threaded implementations will be signal safe: an MPI call suspended by a signal will resume and complete normally after the signal is handled.



    Up: Implementation Issues Next: Examples Previous: Independence of Basic Runtime Routines


    2.10. Examples


    Up: MPI-2 Terms and Conventions Next: Version 1.2 of MPI Previous: Interaction with Signals

    The examples in this document are for illustration purposes only. They are not intended to specify the standard. Furthermore, the examples have not been carefully checked or verified.



    Up: MPI-2 Terms and Conventions Next: Version 1.2 of MPI Previous: Interaction with Signals


    3. Version 1.2 of MPI


    Up: Contents Next: Version Number Previous: Examples

    This section contains clarifications and minor corrections to Version 1.1 of the MPI Standard. The only new function in MPI-1.2 is one for identifying which version of the MPI Standard the implementation being used conforms to. There are small differences between MPI-1 and MPI-1.1. There are very few differences (only those discussed in this chapter) between MPI-1.1 and MPI-1.2, but large differences (the rest of this document) between MPI-1.2 and MPI-2.



    Up: Contents Next: Version Number Previous: Examples


    3.1. Version Number


    Up: Version 1.2 of MPI Next: MPI-1.0 and MPI-1.1 Clarifications Previous: Version 1.2 of MPI

    In order to cope with changes to the MPI Standard, there are both compile-time and run-time ways to determine which version of the standard is in use in the environment one is using.

    The ``version'' will be represented by two separate integers, for the version and subversion:

    In C and C++,

        #define MPI_VERSION    1 
        #define MPI_SUBVERSION 2 
    
    in Fortran,
        INTEGER MPI_VERSION, MPI_SUBVERSION 
        PARAMETER (MPI_VERSION    = 1) 
        PARAMETER (MPI_SUBVERSION = 2) 
    

    For runtime determination,

    MPI_GET_VERSION( version, subversion )
    OUT versionversion number (integer)
    OUT subversionsubversion number (integer)

    int MPI_Get_version(int *version, int *subversion)

    MPI_GET_VERSION(VERSION, SUBVERSION, IERROR)
    INTEGER VERSION, SUBVERSION, IERROR

    MPI_GET_VERSION is one of the few functions that can be called before MPI_INIT and after MPI_FINALIZE. Its C++ binding can be found in the Annex, Section C++ Bindings for New 1.2 Functions .



    Up: Version 1.2 of MPI Next: MPI-1.0 and MPI-1.1 Clarifications Previous: Version 1.2 of MPI


    3.2. MPI-1.0 and MPI-1.1 Clarifications


    Up: Version 1.2 of MPI Next: Clarification of MPI_INITIALIZED Previous: Version Number

    As experience has been gained since the releases of the 1.0 and 1.1 versions of the MPI Standard, it has become apparent that some specifications were insufficiently clear. In this section we attempt to make clear the intentions of the MPI Forum with regard to the behavior of several MPI-1 functions. An MPI-1-compliant implementation should behave in accordance with the clarifications in this section.



    Up: Version 1.2 of MPI Next: Clarification of MPI_INITIALIZED Previous: Version Number


    3.2.1. Clarification of MPI_INITIALIZED


    Up: MPI-1.0 and MPI-1.1 Clarifications Next: Clarification of MPI_FINALIZE Previous: MPI-1.0 and MPI-1.1 Clarifications

    MPI_INITIALIZED returns true if the calling process has called MPI_INIT. Whether MPI_FINALIZE has been called does not affect the behavior of MPI_INITIALIZED.



    Up: MPI-1.0 and MPI-1.1 Clarifications Next: Clarification of MPI_FINALIZE Previous: MPI-1.0 and MPI-1.1 Clarifications


    3.2.2. Clarification of MPI_FINALIZE


    Up: MPI-1.0 and MPI-1.1 Clarifications Next: Clarification of status after MPI_WAIT and MPI_TEST Previous: Clarification of MPI_INITIALIZED

    This routine cleans up all MPI state. Each process must call MPI_FINALIZE before it exits. Unless there has been a call to MPI_ABORT, each process must ensure that all pending non-blocking communications are (locally) complete before calling MPI_FINALIZE. Further, at the instant at which the last process calls MPI_FINALIZE, all pending sends must be matched by a receive, and all pending receives must be matched by a send.

    For example, the following program is correct:

            Process 0                Process 1 
            ---------                --------- 
            MPI_Init();              MPI_Init(); 
            MPI_Send(dest=1);        MPI_Recv(src=0); 
            MPI_Finalize();          MPI_Finalize(); 
    
    Without the matching receive, the program is erroneous:
            Process 0                Process 1 
            -----------              ----------- 
            MPI_Init();              MPI_Init(); 
            MPI_Send (dest=1); 
            MPI_Finalize();          MPI_Finalize(); 
    

    A successful return from a blocking communication operation or from MPI_WAIT or MPI_TEST tells the user that the buffer can be reused and means that the communication is completed by the user, but does not guarantee that the local process has no more work to do. A successful return from MPI_REQUEST_FREE with a request handle generated by an MPI_ISEND nullifies the handle but provides no assurance of operation completion. The MPI_ISEND is complete only when it is known by some means that a matching receive has completed. MPI_FINALIZE guarantees that all local actions required by communications the user has completed will, in fact, occur before it returns.

    MPI_FINALIZE guarantees nothing about pending communications that have not been completed (completion is assured only by MPI_WAIT, MPI_TEST, or MPI_REQUEST_FREE combined with some other verification of completion).


    Example This program is correct:

    rank 0                          rank 1 
    ===================================================== 
    ...                             ... 
    MPI_Isend();                    MPI_Recv(); 
    MPI_Request_free();             MPI_Barrier(); 
    MPI_Barrier();                  MPI_Finalize(); 
    MPI_Finalize();                 exit(); 
    exit();                         
    


    Example This program is erroneous and its behavior is undefined:

    rank 0                          rank 1 
    ===================================================== 
    ...                             ... 
    MPI_Isend();                    MPI_Recv(); 
    MPI_Request_free();             MPI_Finalize(); 
    MPI_Finalize();                 exit(); 
    exit();                         
    

    If no MPI_BUFFER_DETACH occurs between an MPI_BSEND (or other buffered send) and MPI_FINALIZE, the MPI_FINALIZE implicitly supplies the MPI_BUFFER_DETACH.


    Example This program is correct, and after the MPI_Finalize, it is as if the buffer had been detached.

    rank 0                          rank 1 
    ===================================================== 
    ...                             ... 
    buffer = malloc(1000000);       MPI_Recv(); 
    MPI_Buffer_attach();            MPI_Finalize(); 
    MPI_Bsend();                    exit();               
    MPI_Finalize(); 
    free(buffer); 
    exit();                         
    


    Example In this example, MPI_Iprobe() must return a FALSE flag. MPI_Test_cancelled() must return a TRUE flag, independent of the relative order of execution of MPI_Cancel() in process 0 and MPI_Finalize() in process 1.

    The MPI_Iprobe() call is there to make sure the implementation knows that the ``tag1'' message exists at the destination, without being able to claim that the user knows about it.


    rank 0                          rank 1 
    ======================================================== 
    MPI_Init();                     MPI_Init(); 
    MPI_Isend(tag1); 
    MPI_Barrier();                  MPI_Barrier(); 
                                    MPI_Iprobe(tag2); 
    MPI_Barrier();                  MPI_Barrier(); 
                                    MPI_Finalize(); 
                                    exit(); 
    MPI_Cancel(); 
    MPI_Wait(); 
    MPI_Test_cancelled(); 
    MPI_Finalize(); 
    exit(); 
     
    

    Advice to implementors.

    An implementation may need to delay the return from MPI_FINALIZE until all potential future message cancellations have been processed. One possible solution is to place a barrier inside MPI_FINALIZE ( End of advice to implementors.)

    Once MPI_FINALIZE returns, no MPI routine (not even MPI_INIT) may be called, except for MPI_GET_VERSION, MPI_INITIALIZED, and the MPI-2 function MPI_FINALIZED. Each process must complete any pending communication it initiated before it calls MPI_FINALIZE. If the call returns, each process may continue local computations, or exit, without participating in further MPI communication with other processes. MPI_FINALIZE is collective on MPI_COMM_WORLD.


    Advice to implementors.

    Even though a process has completed all the communication it initiated, such communication may not yet be completed from the viewpoint of the underlying MPI system. E.g., a blocking send may have completed, even though the data is still buffered at the sender. The MPI implementation must ensure that a process has completed any involvement in MPI communication before MPI_FINALIZE returns. Thus, if a process exits after the call to MPI_FINALIZE, this will not cause an ongoing communication to fail. ( End of advice to implementors.)

    Although it is not required that all processes return from MPI_FINALIZE, it is required that at least process 0 in MPI_COMM_WORLD return, so that users can know that the MPI portion of the computation is over. In addition, in a POSIX environment, they may desire to supply an exit code for each process that returns from MPI_FINALIZE.


    Example The following illustrates the use of requiring that at least one process return and that it be known that process 0 is one of the processes that return. One wants code like the following to work no matter how many processes return.


        ... 
        MPI_Comm_rank(MPI_COMM_WORLD, &myrank); 
        ... 
        MPI_Finalize(); 
        if (myrank == 0) { 
            resultfile = fopen("outfile","w"); 
            dump_results(resultfile); 
            fclose(resultfile); 
        } 
        exit(0); 
    



    Up: MPI-1.0 and MPI-1.1 Clarifications Next: Clarification of status after MPI_WAIT and MPI_TEST Previous: Clarification of MPI_INITIALIZED


    3.2.3. Clarification of status after MPI_WAIT and MPI_TEST


    Up: MPI-1.0 and MPI-1.1 Clarifications Next: Clarification of MPI_INTERCOMM_CREATE Previous: Clarification of MPI_FINALIZE

    The fields in a status object returned by a call to MPI_WAIT, MPI_TEST, or any of the other derived functions ( MPI_{TEST,WAIT}{ALL,SOME,ANY}), where the request corresponds to a send call, are undefined, with two exceptions: The error status field will contain valid information if the wait or test call returned with MPI_ERR_IN_STATUS; and the returned status can be queried by the call MPI_TEST_CANCELLED.

    Error codes belonging to the error class MPI_ERR_IN_STATUS should be returned only by the MPI completion functions that take arrays of MPI_STATUS. For the functions ( MPI_TEST, MPI_TESTANY, MPI_WAIT, MPI_WAITANY) that return a single MPI_STATUS value, the normal MPI error return process should be used (not the MPI_ERROR field in the MPI_STATUS argument).



    Up: MPI-1.0 and MPI-1.1 Clarifications Next: Clarification of MPI_INTERCOMM_CREATE Previous: Clarification of MPI_FINALIZE


    3.2.4. Clarification of MPI_INTERCOMM_CREATE


    Up: MPI-1.0 and MPI-1.1 Clarifications Next: Clarification of MPI_INTERCOMM_MERGE Previous: Clarification of status after MPI_WAIT and MPI_TEST

    The Problem: The MPI-1.1 standard says, in the discussion of MPI_INTERCOMM_CREATE, both that The groups must be disjoint and that The leaders may be the same process. To further muddy the waters, the reason given for ``The groups must be disjoint'' is based on concerns about the implementation of MPI_INTERCOMM_CREATE that are not applicable for the case where the leaders are the same process.

    The Fix: Delete the text: (the two leaders could be the same process) from the discussion of MPI_INTERCOMM_CREATE.

    Replace the text: All inter-communicator constructors are blocking and require that the local and remote groups be disjoint in order to avoid deadlock. with

    All inter-communicator constructors are blocking and require that the local and remote groups be disjoint.


    Advice to users.

    The groups must be disjoint for several reasons. Primarily, this is the intent of the intercommunicators --- to provide a communicator for communication between disjoint groups. This is reflected in the definition of MPI_INTERCOMM_MERGE, which allows the user to control the ranking of the processes in the created intracommunicator; this ranking makes little sense if the groups are not disjoint. In addition, the natural extension of collective operations to intercommunicators makes the most sense when the groups are disjoint. ( End of advice to users.)



    Up: MPI-1.0 and MPI-1.1 Clarifications Next: Clarification of MPI_INTERCOMM_MERGE Previous: Clarification of status after MPI_WAIT and MPI_TEST


    3.2.5. Clarification of MPI_INTERCOMM_MERGE


    Up: MPI-1.0 and MPI-1.1 Clarifications Next: Clarification of Binding of MPI_TYPE_SIZE Previous: Clarification of MPI_INTERCOMM_CREATE

    The error handler on the new intercommunicator in each process is inherited from the communicator that contributes the local group. Note that this can result in different processes in the same communicator having different error handlers.



    Up: MPI-1.0 and MPI-1.1 Clarifications Next: Clarification of Binding of MPI_TYPE_SIZE Previous: Clarification of MPI_INTERCOMM_CREATE


    3.2.6. Clarification of Binding of MPI_TYPE_SIZE


    Up: MPI-1.0 and MPI-1.1 Clarifications Next: Clarification of MPI_REDUCE Previous: Clarification of MPI_INTERCOMM_MERGE

    This clarification is needed in the MPI-1 description of MPI_TYPE_SIZE, since the issue repeatedly arises. It is a clarification of the binding.


    Advice to users.

    The MPI-1 Standard specifies that the output argument of MPI_TYPE_SIZE in C is of type int. The MPI Forum considered proposals to change this and decided to reiterate the original decision. ( End of advice to users.)



    Up: MPI-1.0 and MPI-1.1 Clarifications Next: Clarification of MPI_REDUCE Previous: Clarification of MPI_INTERCOMM_MERGE


    3.2.7. Clarification of MPI_REDUCE


    Up: MPI-1.0 and MPI-1.1 Clarifications Next: Clarification of Error Behavior of Attribute Callback Functions Previous: Clarification of Binding of MPI_TYPE_SIZE

    The current text on p. 115, lines 25--28, from MPI-1.1 (June 12, 1995) says:

    The datatype argument of MPI_REDUCE must be compatible with op. Predefined operators work only with the MPI types listed in Section 4.9.2 and Section 4.9.3. User-defined operators may operate on general, derived datatypes.

    This text is changed to:

    The datatype argument of MPI_REDUCE must be compatible with op. Predefined operators work only with the MPI types listed in Section 4.9.2 and Section 4.9.3. Furthermore, the datatype and op given for predefined operators must be the same on all processes.

    Note that it is possible for users to supply different user-defined operations to MPI_REDUCE in each process. MPI does not define which operations are used on which operands in this case.


    Advice to users.

    Users should make no assumptions about how MPI_REDUCE is implemented. Safest is to ensure that the same function is passed to MPI_REDUCE by each process. ( End of advice to users.)

    Overlapping datatypes are permitted in ``send'' buffers. Overlapping datatypes in ``receive'' buffers are erroneous and may give unpredictable results.



    Up: MPI-1.0 and MPI-1.1 Clarifications Next: Clarification of Error Behavior of Attribute Callback Functions Previous: Clarification of Binding of MPI_TYPE_SIZE


    3.2.8. Clarification of Error Behavior of Attribute Callback Functions


    Up: MPI-1.0 and MPI-1.1 Clarifications Next: Clarification of MPI_PROBE and MPI_IPROBE Previous: Clarification of MPI_REDUCE

    If an attribute copy function or attribute delete function returns other than MPI_SUCCESS, then the call that caused it to be invoked (for example, MPI_COMM_FREE), is erroneous.



    Up: MPI-1.0 and MPI-1.1 Clarifications Next: Clarification of MPI_PROBE and MPI_IPROBE Previous: Clarification of MPI_REDUCE


    3.2.9. Clarification of MPI_PROBE and MPI_IPROBE


    Up: MPI-1.0 and MPI-1.1 Clarifications Next: Minor Corrections Previous: Clarification of Error Behavior of Attribute Callback Functions

    Page 52, lines 1 thru 3 (of MPI-1.1, the June 12, 1995 version without changebars) become:

    ``A subsequent receive executed with the same communicator, and the source and tag returned in status by MPI_IPROBE will receive the message that was matched by the probe, if no other intervening receive occurs after the probe, and the send is not successfully cancelled before the receive.''


    Rationale.

    The following program shows that the MPI-1 definitions of cancel and probe are in conflict:


    Process 0                        Process 1 
    ----------                       ---------- 
    MPI_Init();                      MPI_Init(); 
    MPI_Isend(dest=1);             
                                     MPI_Probe(); 
    MPI_Barrier();                   MPI_Barrier(); 
    MPI_Cancel();                     
    MPI_Wait();                       
    MPI_Test_cancelled();             
    MPI_Barrier();                   MPI_Barrier(); 
                                     MPI_Recv(); 
     
    
    Since the send has been cancelled by process 0, the wait must be local (page 54, line 13) and must return before the matching receive. For the wait to be local, the send must be successfully cancelled, and therefore must not match the receive in process 1 (page 54 line 29).

    However, it is clear that the probe on process 1 must eventually detect an incoming message. Page 52 line 1 makes it clear that the subsequent receive by process 1 must return the probed message.

    The above are clearly contradictory, and therefore the text ``... and the send is not successfully cancelled before the receive'' must be added to line 3 of page 54.

    An alternative solution (rejected) would be to change the semantics of cancel so that the call is not local if the message has been probed. This adds complexity to implementations, and adds a new concept of ``state'' to a message (probed or not). It would, however, preserve the feature that a blocking receive after a probe is local.

    ( End of rationale.)



    Up: MPI-1.0 and MPI-1.1 Clarifications Next: Minor Corrections Previous: Clarification of Error Behavior of Attribute Callback Functions


    3.2.10. Minor Corrections


    Up: MPI-1.0 and MPI-1.1 Clarifications Next: Miscellany Previous: Clarification of MPI_PROBE and MPI_IPROBE

    The following corrections to MPI-1.1 are (all page and line numbers are for the June 12, 1995 version without changebars):




    Up: MPI-1.0 and MPI-1.1 Clarifications Next: Miscellany Previous: Clarification of MPI_PROBE and MPI_IPROBE


    4. Miscellany


    Up: Contents Next: Portable MPI Process Startup Previous: Minor Corrections

    This chapter contains topics that do not fit conveniently into other chapters.



    Up: Contents Next: Portable MPI Process Startup Previous: Minor Corrections


    4.1. Portable MPI Process Startup


    Up: Miscellany Next: Passing NULL to MPI_Init Previous: Miscellany

    A number of implementations of MPI-1 provide a startup command for MPI programs that is of the form

        mpirun <mpirun arguments> <program> <program arguments> 
    
    Separating the command to start the program from the program itself provides flexibility, particularly for network and heterogeneous implementations. For example, the startup script need not run on one of the machines that will be executing the MPI program itself.

    Having a standard startup mechanism also extends the portability of MPI programs one step further, to the command lines and scripts that manage them. For example, a validation suite script that runs hundreds of programs can be a portable script if it is written using such a standard starup mechanism. In order that the ``standard'' command not be confused with existing practice, which is not standard and not portable among implementations, instead of mpirun MPI specifies mpiexec.

    While a standardized startup mechanism improves the usability of MPI, the range of environments is so diverse (e.g., there may not even be a command line interface) that MPI cannot mandate such a mechanism. Instead, MPI specifies an mpiexec startup command and recommends but does not require it, as advice to implementors. However, if an implementation does provide a command called mpiexec, it must be of the form described below.

    It is suggested that

        mpiexec -n <numprocs> <program> 
    
    be at least one way to start <program> with an initial MPI_COMM_WORLD whose group contains <numprocs> processes. Other arguments to mpiexec may be implementation-dependent.

    This is advice to implementors, rather than a required part of MPI-2. It is not suggested that this be the only way to start MPI programs. If an implementation does provide a command called mpiexec, however, it must be of the form described here.


    Advice to implementors.

    Implementors, if they do provide a special startup command for MPI programs, are advised to give it the following form. The syntax is chosen in order that mpiexec be able to be viewed as a command-line version of MPI_COMM_SPAWN (See Section Reserved Keys ).

    Analogous to MPI_COMM_SPAWN, we have


        mpiexec -n    <maxprocs> 
               -soft  <        > 
               -host  <        > 
               -arch  <        > 
               -wdir  <        > 
               -path  <        > 
               -file  <        > 
                ... 
               <command line> 
    
    for the case where a single command line for the application program and its arguments will suffice. See Section Reserved Keys for the meanings of these arguments. For the case corresponding to MPI_COMM_SPAWN_MULTIPLE there are two possible formats:

    Form A:


        mpiexec { <above arguments> } : { ... } : { ... } : ... : { ... } 
    
    As with MPI_COMM_SPAWN, all the arguments are optional. (Even the -n x argument is optional; the default is implementation dependent. It might be 1, it might be taken from an environment variable, or it might be specified at compile time.) The names and meanings of the arguments are taken from the keys in the info argument to MPI_COMM_SPAWN. There may be other, implementation-dependent arguments as well.

    Note that Form A, though convenient to type, prevents colons from being program arguments. Therefore an alternate, file-based form is allowed:

    Form B:


        mpiexec -configfile <filename> 
    
    where the lines of <filename> are of the form separated by the colons in Form A. Lines beginning with ` #' are comments, and lines may be continued by terminating the partial line with ` \\'.


    Example Start 16 instances of myprog on the current or default machine:

        mpiexec -n 16 myprog 
    

    Example Start 10 processes on the machine called ferrari:
        mpiexec -n 10 -host ferrari myprog 
    

    Example Start three copies of the same program with different command-line arguments:
        mpiexec myprog infile1 : myprog infile2 : myprog infile3 
    

    Example Start the ocean program on five Suns and the atmos program on 10 RS/6000's:
        mpiexec -n 5 -arch sun ocean : -n 10 -arch rs6000 atmos 
    
    It is assumed that the implementation in this case has a method for choosing hosts of the appropriate type. Their ranks are in the order specified.
    Example Start the ocean program on five Suns and the atmos program on 10 RS/6000's (Form B):
        mpiexec -configfile myfile 
    
    where myfile contains
        -n 5  -arch sun    ocean  
        -n 10 -arch rs6000 atmos 
    

    ( End of advice to implementors.)



    Up: Miscellany Next: Passing NULL to MPI_Init Previous: Miscellany


    4.2. Passing NULL to MPI_Init


    Up: Miscellany Next: Version Number Previous: Portable MPI Process Startup

    In MPI-1.1, it is explicitly stated that an implementation is allowed to require that the arguments argc and argv passed by an application to MPI_INIT in C be the same arguments passed into the application as the arguments to main. In MPI-2 implementations are not allowed to impose this requirement. Conforming implementations of MPI are required to allow applications to pass NULL for both the argc and argv arguments of main. In C++, there is an alternative binding for MPI::Init that does not have these arguments at all.


    Rationale.

    In some applications, libraries may be making the call to MPI_Init, and may not have access to argc and argv from main. It is anticipated that applications requiring special information about the environment or information supplied by mpiexec can get that information from environment variables. ( End of rationale.)



    Up: Miscellany Next: Version Number Previous: Portable MPI Process Startup


    4.3. Version Number


    Up: Miscellany Next: Datatype Constructor MPI_TYPE_CREATE_INDEXED_BLOCK Previous: Passing NULL to MPI_Init

    The values for the MPI_VERSION and MPI_SUBVERSION for an MPI-2 implementation are 2 and 0 respectively. This applies both to the values of the above constants and to the values returned by MPI_GET_VERSION.



    Up: Miscellany Next: Datatype Constructor MPI_TYPE_CREATE_INDEXED_BLOCK Previous: Passing NULL to MPI_Init


    4.4. Datatype Constructor MPI_TYPE_CREATE_INDEXED_BLOCK


    Up: Miscellany Next: Treatment of MPI_Status Previous: Version Number

    This function is the same as MPI_TYPE_INDEXED except that the blocklength is the same for all blocks. There are many codes using indirect addressing arising from unstructured grids where the blocksize is always 1 (gather/scatter). The following convenience function allows for constant blocksize and arbitrary displacements.

    MPI_TYPE_CREATE_INDEXED_BLOCK(count, blocklength, array_of_displacements, oldtype, newtype)
    IN countlength of array of displacements (integer)
    IN blocklengthsize of block (integer)
    IN array_of_displacementsarray of displacements (array of integer)
    IN oldtypeold datatype (handle)
    OUT newtypenew datatype (handle)

    int MPI_Type_create_indexed_block(int count, int blocklength, int array_of_displacements[], MPI_Datatype oldtype, MPI_Datatype *newtype)

    MPI_TYPE_CREATE_INDEXED_BLOCK(COUNT, BLOCKLENGTH, ARRAY_OF_DISPLACEMENTS, OLDTYPE, NEWTYPE, IERROR)
    INTEGER COUNT, BLOCKLENGTH, ARRAY_OF_DISPLACEMENTS(*), OLDTYPE, NEWTYPE, IERROR

    MPI::Datatype MPI::Datatype::Create_indexed_block( int count, int blocklength, const int array_of_displacements[]) const



    Up: Miscellany Next: Treatment of MPI_Status Previous: Version Number


    4.5. Treatment of MPI_Status


    Up: Miscellany Next: Passing MPI_STATUS_IGNORE for Status Previous: Datatype Constructor MPI_TYPE_CREATE_INDEXED_BLOCK

    The following features add to, but do not change, the functionality associated with MPI_STATUS.



    Up: Miscellany Next: Passing MPI_STATUS_IGNORE for Status Previous: Datatype Constructor MPI_TYPE_CREATE_INDEXED_BLOCK


    4.5.1. Passing MPI_STATUS_IGNORE for Status


    Up: Treatment of MPI_Status Next: Non-destructive Test of status Previous: Treatment of MPI_Status

    Every call to MPI_RECV includes a status argument, wherein the system can return details about the message received. There are also a number of other MPI calls, particularly in MPI-2, where status is returned. An object of type MPI_STATUS is not an MPI opaque object; its structure is declared in mpi.h and mpif.h, and it exists in the user's program. In many cases, application programs are constructed so that it is unnecessary for them to examine the status fields. In these cases, it is a waste for the user to allocate a status object, and it is particularly wasteful for the MPI implementation to fill in fields in this object.

    To cope with this problem, there are two predefined constants, MPI_STATUS_IGNORE and MPI_STATUSES_IGNORE, which when passed to a receive, wait, or test function, inform the implementation that the status fields are not to be filled in. Note that MPI_STATUS_IGNORE is not a special type of MPI_STATUS object; rather, it is a special value for the argument. In C one would expect it to be NULL, not the address of a special MPI_STATUS.

    MPI_STATUS_IGNORE, and the array version MPI_STATUSES_IGNORE, can be used everywhere a status argument is passed to a receive, wait, or test function. MPI_STATUS_IGNORE cannot be used when status is an IN argument. Note that in Fortran MPI_STATUS_IGNORE and MPI_STATUSES_IGNORE are objects like MPI_BOTTOM (not usable for initialization or assignment). See Section Named Constants .

    In general, this optimization can apply to all functions for which status or an array of statuses is an OUT argument. Note that this converts status into an INOUT argument. The functions that can be passed MPI_STATUS_IGNORE are all the various forms of MPI_RECV, MPI_TEST, and MPI_WAIT, as well as MPI_REQUEST_GET_STATUS. When an array is passed, as in the ANY and ALL functions, a separate constant, MPI_STATUSES_IGNORE, is passed for the array argument. It is possible for an MPI function to return MPI_ERR_IN_STATUS even when MPI_STATUS_IGNORE or MPI_STATUSES_IGNORE has been passed to that function.

    MPI_STATUS_IGNORE and MPI_STATUSES_IGNORE are not required to have the same values in C and Fortran.

    It is not allowed to have some of the statuses in an array of statuses for _ANY and _ALL functions set to MPI_STATUS_IGNORE; one either specifies ignoring all of the statuses in such a call with MPI_STATUSES_IGNORE, or none of them by passing normal statuses in all positions in the array of statuses.

    There are no C++ bindings for MPI_STATUS_IGNORE or MPI_STATUSES_IGNORE. To allow an OUT or INOUT MPI::Status argument to be ignored, all MPI C++ bindings that have OUT or INOUT MPI::Status parameters are overloaded with a second version that omits the OUT or INOUT MPI::Status parameter.


    ExampleThe C++ bindings for MPI_PROBE are:

    void MPI::Comm::Probe(int source, int tag, MPI::Status& status) const

    void MPI::Comm::Probe(int source, int tag) const



    Up: Treatment of MPI_Status Next: Non-destructive Test of status Previous: Treatment of MPI_Status


    4.5.2. Non-destructive Test of status


    Up: Treatment of MPI_Status Next: Error Class for Invalid Keyval Previous: Passing MPI_STATUS_IGNORE for Status

    This call is useful for accessing the information associated with a request, without freeing the request (in case the user is expected to access it later). It allows one to layer libraries more conveniently, since multiple layers of software may access the same completed request and extract from it the status information.

    MPI_REQUEST_GET_STATUS( request, flag, status )
    IN requestrequest (handle)
    OUT flagboolean flag, same as from MPI_TEST (logical)
    OUT status MPI_STATUS object if flag is true (Status)

    int MPI_Request_get_status(MPI_Request request, int *flag, MPI_Status *status)

    MPI_REQUEST_GET_STATUS( REQUEST, FLAG, STATUS, IERROR)
    INTEGER REQUEST, STATUS(MPI_STATUS_SIZE), IERROR
    LOGICAL FLAG

    bool MPI::Request::Get_status(MPI::Status& status) const
    bool MPI::Request::Get_status() const

    Sets flag=true if the operation is complete, and, if so, returns in status the request status. However, unlike test or wait, it does not deallocate or inactivate the request; a subsequent call to test, wait or free should be executed with that request. It sets flag=false if the operation is not complete.



    Up: Treatment of MPI_Status Next: Error Class for Invalid Keyval Previous: Passing MPI_STATUS_IGNORE for Status


    4.6. Error Class for Invalid Keyval


    Up: Miscellany Next: Committing a Committed Datatype Previous: Non-destructive Test of status

    Key values for attributes are system-allocated, by MPI_{TYPE,COMM,WIN}_CREATE_KEYVAL. Only such values can be passed to the functions that use key values as input arguments. In order to signal that an erroneous key value has been passed to one of these functions, there is a new MPI error class: MPI_ERR_KEYVAL. It can be returned by MPI_ATTR_PUT, MPI_ATTR_GET, MPI_ATTR_DELETE, MPI_KEYVAL_FREE, MPI_{TYPE,COMM,WIN}_DELETE_ATTR, MPI_{TYPE,COMM,WIN}_SET_ATTR, MPI_{TYPE,COMM,WIN}_GET_ATTR, MPI_{TYPE,COMM,WIN}_FREE_KEYVAL, MPI_COMM_DUP, MPI_COMM_DISCONNECT, and MPI_COMM_FREE. The last three are included because keyval is an argument to the copy and delete functions for attributes.



    Up: Miscellany Next: Committing a Committed Datatype Previous: Non-destructive Test of status


    4.7. Committing a Committed Datatype


    Up: Miscellany Next: Allowing User Functions at Process Termination Previous: Error Class for Invalid Keyval

    In MPI-1.2, the effect of calling MPI_TYPE_COMMIT with a datatype that is already committed is not specified. For MPI-2, it is specified that MPI_TYPE_COMMIT will accept a committed datatype; in this case, it is equivalent to a no-op.



    Up: Miscellany Next: Allowing User Functions at Process Termination Previous: Error Class for Invalid Keyval


    4.8. Allowing User Functions at Process Termination


    Up: Miscellany Next: Determining Whether MPI Has Finished Previous: Committing a Committed Datatype

    There are times in which it would be convenient to have actions happen when an MPI process finishes. For example, a routine may do initializations that are useful until the MPI job (or that part of the job that being terminated in the case of dynamically created processes) is finished. This can be accomplished in MPI-2 by attaching an attribute to MPI_COMM_SELF with a callback function. When MPI_FINALIZEis called, it will first execute the equivalent of an MPI_COMM_FREE on MPI_COMM_SELF. This will cause the delete callback function to be executed on all keys associated with MPI_COMM_SELF, in an arbitrary order. If no key has been attached to MPI_COMM_SELF, then no callback is invoked. The ``freeing'' of MPI_COMM_SELF occurs before any other parts of MPI are affected. Thus, for example, calling MPI_FINALIZED will return false in any of these callback functions. Once done with MPI_COMM_SELF, the order and rest of the actions taken by MPI_FINALIZE is not specified.


    Advice to implementors.

    Since attributes can be added from any supported language, the MPI implementation needs to remember the creating language so the correct callback is made. ( End of advice to implementors.)



    Up: Miscellany Next: Determining Whether MPI Has Finished Previous: Committing a Committed Datatype


    4.9. Determining Whether MPI Has Finished


    Up: Miscellany Next: The Info Object Previous: Allowing User Functions at Process Termination

    One of the goals of MPI was to allow for layered libraries. In order for a library to do this cleanly, it needs to know if MPI is active. In MPI-1 the function MPI_INITIALIZED was provided to tell if MPI had been initialized. The problem arises in knowing if MPI has been finalized. Once MPI has been finalized it is no longer active and cannot be restarted. A library needs to be able to determine this to act accordingly. To achieve this the following function is needed:

    MPI_FINALIZED(flag)
    OUT flagtrue if MPI was finalized (logical)

    int MPI_Finalized(int *flag)

    MPI_FINALIZED(FLAG, IERROR)
    LOGICAL FLAG
    INTEGER IERROR

    bool MPI::Is_finalized()

    This routine returns true if MPI_FINALIZE has completed. It is legal to call MPI_FINALIZED before MPI_INIT and after MPI_FINALIZE.


    Advice to users.

    MPI is ``active'' and it is thus safe to call MPI functions if MPI_INIT has completed and MPI_FINALIZE has not completed. If a library has no other way of knowing whether MPI is active or not, then it can use MPI_INITIALIZED and MPI_FINALIZED to determine this. For example, MPI is ``active'' in callback functions that are invoked during MPI_FINALIZE. ( End of advice to users.)



    Up: Miscellany Next: The Info Object Previous: Allowing User Functions at Process Termination


    4.10. The Info Object


    Up: Miscellany Next: Memory Allocation Previous: Determining Whether MPI Has Finished

    Many of the routines in MPI-2 take an argument info. info is an opaque object with a handle of type MPI_Info in C, MPI::Info in C++, and INTEGER in Fortran. It consists of ( key, value) pairs (both key and value are strings). A key may have only one value. MPI reserves several keys and requires that if an implementation uses a reserved key, it must provide the specified functionality. An implementation is not required to support these keys and may support any others not reserved by MPI.

    If a function does not recognize a key, it will ignore it, unless otherwise specified. If an implementation recognizes a key but does not recognize the format of the corresponding value, the result is undefined.

    Keys have an implementation-defined maximum length of MPI_MAX_INFO_KEY, which is at least 32 and at most 255. Values have an implementation-defined maximum length of MPI_MAX_INFO_VAL. In Fortran, leading and trailing spaces are stripped from both. Returned values will never be larger than these maximum lengths. Both key and value are case sensitive.


    Rationale.

    Keys have a maximum length because the set of known keys will always be finite and known to the implementation and because there is no reason for keys to be complex. The small maximum size allows applications to declare keys of size MPI_MAX_INFO_KEY. The limitation on value sizes is so that an implementation is not forced to deal with arbitrarily long strings. ( End of rationale.)

    Advice to users.

    MPI_MAX_INFO_VAL might be very large, so it might not be wise to declare a string of that size. ( End of advice to users.)
    When it is an argument to a non-blocking routine, info is parsed before that routine returns, so that it may be modified or freed immediately after return.

    When the descriptions refer to a key or value as being a boolean, an integer, or a list, they mean the string representation of these types. An implementation may define its own rules for how info value strings are converted to other types, but to ensure portability, every implementation must support the following representations. Legal values for a boolean must include the strings ``true'' and ``false'' (all lowercase). For integers, legal values must include string representations of decimal values of integers that are within the range of a standard integer type in the program. (However it is possible that not every legal integer is a legal value for a given key.) On positive numbers, + signs are optional. No space may appear between a + or - sign and the leading digit of a number. For comma separated lists, the string must contain legal elements separated by commas. Leading and trailing spaces are stripped automatically from the types of info values described above and for each element of a comma separated list. These rules apply to all info values of these types. Implementations are free to specify a different interpretation for values of other info keys.

    MPI_INFO_CREATE(info)
    OUT infoinfo object created (handle)
    int MPI_Info_create(MPI_Info *info)
    MPI_INFO_CREATE(INFO, IERROR)
    INTEGER INFO, IERROR
    static MPI::Info MPI::Info::Create()

    MPI_INFO_CREATE creates a new info object. The newly created object contains no key/value pairs.

    MPI_INFO_SET(info, key, value)
    INOUT infoinfo object (handle)
    IN keykey (string)
    IN valuevalue (string)
    int MPI_Info_set(MPI_Info info, char *key, char *value)
    MPI_INFO_SET(INFO, KEY, VALUE, IERROR)
    INTEGER INFO, IERROR
    CHARACTER*(*) KEY, VALUE
    void MPI::Info::Set(const char* key, const char* value)

    MPI_INFO_SET adds the (key,value) pair to info, and overrides the value if a value for the same key was previously set. key and value are null-terminated strings in C. In Fortran, leading and trailing spaces in key and value are stripped. If either key or value are larger than the allowed maximums, the errors MPI_ERR_INFO_KEY or MPI_ERR_INFO_VALUE are raised, respectively.

    MPI_INFO_DELETE(info, key)
    INOUT infoinfo object (handle)
    IN keykey (string)
    int MPI_Info_delete(MPI_Info info, char *key)
    MPI_INFO_DELETE(INFO, KEY, IERROR)
    INTEGER INFO, IERROR
    CHARACTER*(*) KEY
    void MPI::Info::Delete(const char* key)

    MPI_INFO_DELETE deletes a (key,value) pair from info. If key is not defined in info, the call raises an error of class MPI_ERR_INFO_NOKEY.

    MPI_INFO_GET(info, key, valuelen, value, flag)
    IN infoinfo object (handle)
    IN keykey (string)
    IN valuelenlength of value arg (integer)
    OUT valuevalue (string)
    OUT flag true if key defined, false if not (boolean)
    int MPI_Info_get(MPI_Info info, char *key, int valuelen, char *value, int *flag)
    MPI_INFO_GET(INFO, KEY, VALUELEN, VALUE, FLAG, IERROR)
    INTEGER INFO, VALUELEN, IERROR
    CHARACTER*(*) KEY, VALUE
    LOGICAL FLAG
    bool MPI::Info::Get(const char* key, int valuelen, char* value) const

    This function retrieves the value associated with key in a previous call to MPI_INFO_SET. If such a key exists, it sets flag to true and returns the value in value, otherwise it sets flag to false and leaves value unchanged. valuelen is the number of characters available in value. If it is less than the actual size of the value, the value is truncated. In C, valuelen should be one less than the amount of allocated space to allow for the null terminator.

    If key is larger than MPI_MAX_INFO_KEY, the call is erroneous.

    MPI_INFO_GET_VALUELEN(info, key, valuelen, flag)
    IN infoinfo object (handle)
    IN keykey (string)
    OUT valuelenlength of value arg (integer)
    OUT flag true if key defined, false if not (boolean)
    int MPI_Info_get_valuelen(MPI_Info info, char *key, int *valuelen, int *flag)
    MPI_INFO_GET_VALUELEN(INFO, KEY, VALUELEN, FLAG, IERROR)
    INTEGER INFO, VALUELEN, IERROR
    LOGICAL FLAG
    CHARACTER*(*) KEY
    bool MPI::Info::Get_valuelen(const char* key, int& valuelen) const

    Retrieves the length of the value associated with key. If key is defined, valuelen is set to the length of its associated value and flag is set to true. If key is not defined, valuelen is not touched and flag is set to false. The length returned in C or C++ does not include the end-of-string character.

    If key is larger than MPI_MAX_INFO_KEY, the call is erroneous.

    MPI_INFO_GET_NKEYS(info, nkeys)
    IN infoinfo object (handle)
    OUT nkeysnumber of defined keys (integer)
    int MPI_Info_get_nkeys(MPI_Info info, int *nkeys)
    MPI_INFO_GET_NKEYS(INFO, NKEYS, IERROR)
    INTEGER INFO, NKEYS, IERROR
    int MPI::Info::Get_nkeys() const

    MPI_INFO_GET_NKEYS returns the number of currently defined keys in info.

    MPI_INFO_GET_NTHKEY(info, n, key)
    IN infoinfo object (handle)
    IN nkey number (integer)
    OUT keykey (string)
    int MPI_Info_get_nthkey(MPI_Info info, int n, char *key)
    MPI_INFO_GET_NTHKEY(INFO, N, KEY, IERROR)
    INTEGER INFO, N, IERROR
    CHARACTER*(*) KEY
    void MPI::Info::Get_nthkey(int n, char* key) const

    This function returns the nth defined key in info. Keys are numbered 0 ... N-1 where N is the value returned by MPI_INFO_GET_NKEYS. All keys between 0 and N-1 are guaranteed to be defined. The number of a given key does not change as long as info is not modified with MPI_INFO_SET or MPI_INFO_DELETE.

    MPI_INFO_DUP(info, newinfo)
    IN infoinfo object (handle)
    OUT newinfoinfo object (handle)
    int MPI_Info_dup(MPI_Info info, MPI_Info *newinfo)
    MPI_INFO_DUP(INFO, NEWINFO, IERROR)
    INTEGER INFO, NEWINFO, IERROR
    MPI::Info MPI::Info::Dup() const

    MPI_INFO_DUP duplicates an existing info object, creating a new object, with the same (key,value) pairs and the same ordering of keys.

    MPI_INFO_FREE(info)
    INOUT infoinfo object (handle)
    int MPI_Info_free(MPI_Info *info)
    MPI_INFO_FREE(INFO, IERROR)
    INTEGER INFO, IERROR
    void MPI::Info::Free()

    This function frees info and sets it to MPI_INFO_NULL. The value of an info argument is interpreted each time the info is passed to a routine. Changes to an info after return from a routine do not affect that interpretation.



    Up: Miscellany Next: Memory Allocation Previous: Determining Whether MPI Has Finished


    4.11. Memory Allocation


    Up: Miscellany Next: Language Interoperability Previous: The Info Object

    In some systems, message-passing and remote-memory-access ( RMA) operations run faster when accessing specially allocated memory (e.g., memory that is shared by the other processes in the communicating group on an SMP). MPI provides a mechanism for allocating and freeing such special memory. The use of such memory for message passing or RMA is not mandatory, and this memory can be used without restrictions as any other dynamically allocated memory. However, implementations may restrict the use of the MPI_WIN_LOCK and MPI_WIN_UNLOCK functions to windows allocated in such memory (see Section Lock .)

    MPI_ALLOC_MEM(size, info, baseptr)
    IN sizesize of memory segment in bytes (nonnegative integer)
    IN infoinfo argument (handle)
    OUT baseptrpointer to beginning of memory segment allocated

    int MPI_Alloc_mem(MPI_Aint size, MPI_Info info, void *baseptr)

    MPI_ALLOC_MEM(SIZE, INFO, BASEPTR, IERROR)
    INTEGER INFO, IERROR
    INTEGER(KIND=MPI_ADDRESS_KIND) SIZE, BASEPTR

    void* MPI::Alloc_mem(MPI::Aint size, const MPI::Info& info)

    The info argument can be used to provide directives that control the desired location of the allocated memory. Such a directive does not affect the semantics of the call. Valid info values are implementation-dependent; a null directive value of info = MPI_INFO_NULL is always valid.

    The function MPI_ALLOC_MEM may return an error code of class MPI_ERR_NO_MEM to indicate it failed because memory is exhausted.

    MPI_FREE_MEM(base)
    IN baseinitial address of memory segment allocated by
    MPI_ALLOC_MEM (choice)

    int MPI_Free_mem(void *base)

    MPI_FREE_MEM(BASE, IERROR)
    <type> BASE(*)
    INTEGER IERROR

    void MPI::Free_mem(void *base)

    The function MPI_FREE_MEM may return an error code of class MPI_ERR_BASE to indicate an invalid base argument.


    Rationale.

    The C and C++ bindings of MPI_ALLOC_MEM and MPI_FREE_MEM are similar to the bindings for the malloc and free C library calls: a call to MPI_Alloc_mem(..., &base) should be paired with a call to MPI_Free_mem(base) (one less level of indirection). Both arguments are declared to be of same type void* so as to facilitate type casting. The Fortran binding is consistent with the C and C++ bindings: the Fortran MPI_ALLOC_MEM call returns in baseptr the (integer valued) address of the allocated memory. The base argument of MPI_FREE_MEM is a choice argument, which passes (a reference to) the variable stored at that location. ( End of rationale.)

    Advice to implementors.

    If MPI_ALLOC_MEM allocates special memory, then a design similar to the design of C malloc and free functions has to be used, in order to find out the size of a memory segment, when the segment is freed. If no special memory is used, MPI_ALLOC_MEM simply invokes malloc, and MPI_FREE_MEM invokes free.

    A call to MPI_ALLOC_MEM can be used in shared memory systems to allocate memory in a shared memory segment. ( End of advice to implementors.)

    Example

    Example of use of MPI_ALLOC_MEM, in Fortran with pointer support. We assume 4-byte REALs, and assume that pointers are address-sized.

    REAL A 
    POINTER (P, A(100,100))   ! no memory is allocated 
    CALL MPI_ALLOC_MEM(4*100*100, MPI_INFO_NULL, P, IERR) 
    ! memory is allocated 
    ... 
    A(3,5) = 2.71; 
    ... 
    CALL MPI_FREE_MEM(A, IERR) ! memory is freed 
    
    Since standard Fortran does not support (C-like) pointers, this code is not Fortran 77 or Fortran 90 code. Some compilers (in particular, at the time of writing, g77 and Fortran compilers for Intel) do not support this code.


    Example Same example, in C

    float  (* f)[100][100] ; 
    MPI_Alloc_mem(sizeof(float)*100*100, MPI_INFO_NULL, &f); 
    ... 
    (*f)[5][3] = 2.71; 
    ... 
    MPI_Free_mem(f); 
    



    Up: Miscellany Next: Language Interoperability Previous: The Info Object


    4.12. Language Interoperability


    Up: Miscellany Next: Introduction Previous: Memory Allocation



    Up: Miscellany Next: Introduction Previous: Memory Allocation


    4.12.1. Introduction


    Up: Language Interoperability Next: Assumptions Previous: Language Interoperability

    It is not uncommon for library developers to use one language to develop an applications library that may be called by an application program written in a different language. MPI currently supports ISO (previously ANSI) C, C++, and Fortran bindings. It should be possible for applications in any of the supported languages to call MPI-related functions in another language.

    Moreover, MPI allows the development of client-server code, with MPI communication used between a parallel client and a parallel server. It should be possible to code the server in one language and the clients in another language. To do so, communications should be possible between applications written in different languages.

    There are several issues that need to be addressed in order to achieve interoperability.

    Initialization
    We need to specify how the MPI environment is initialized for all languages.
    Interlanguage passing of { MPI} opaque objects
    We need to specify how MPI object handles are passed between languages. We also need to specify what happens when an MPI object is accessed in one language, to retrieve information (e.g., attributes) set in another language.
    Interlanguage communication
    We need to specify how messages sent in one language can be received in another language.

    It is highly desirable that the solution for interlanguage interoperability be extendable to new languages, should MPI bindings be defined for such languages.



    Up: Language Interoperability Next: Assumptions Previous: Language Interoperability


    4.12.2. Assumptions


    Up: Language Interoperability Next: Initialization Previous: Introduction

    We assume that conventions exist for programs written in one language to call functions in written in another language. These conventions specify how to link routines in different languages into one program, how to call functions in a different language, how to pass arguments between languages, and the correspondence between basic data types in different languages. In general, these conventions will be implementation dependent. Furthermore, not every basic datatype may have a matching type in other languages. For example, C/C++ character strings may not be compatible with Fortran CHARACTER variables. However, we assume that a Fortran INTEGER, as well as a (sequence associated) Fortran array of INTEGERs, can be passed to a C or C++ program. We also assume that Fortran, C, and C++ have address-sized integers. This does not mean that the default-size integers are the same size as default-sized pointers, but only that there is some way to hold (and pass) a C address in a Fortran integer. It is also assumed that INTEGER(KIND=MPI_OFFSET_KIND) can be passed from Fortran to C as MPI_Offset.



    Up: Language Interoperability Next: Initialization Previous: Introduction


    4.12.3. Initialization


    Up: Language Interoperability Next: Transfer of Handles Previous: Assumptions

    A call to MPI_INIT or MPI_THREAD_INIT, from any language, initializes MPI for execution in all languages.


    Advice to users.

    Certain implementations use the (inout) argc, argv arguments of the C/C++ version of MPI_INIT in order to propagate values for argc and argv to all executing processes. Use of the Fortran version of MPI_INIT to initialize MPI may result in a loss of this ability. ( End of advice to users.)

    The function MPI_INITIALIZED returns the same answer in all languages.

    The function MPI_FINALIZE finalizes the MPI environments for all languages.

    The function MPI_FINALIZED returns the same answer in all languages.

    The function MPI_ABORT kills processes, irrespective of the language used by the caller or by the processes killed.

    The MPI environment is initialized in the same manner for all languages by MPI_INIT. E.g., MPI_COMM_WORLD carries the same information regardless of language: same processes, same environmental attributes, same error handlers.

    Information can be added to info objects in one language and retrieved in another.


    Advice to users.

    The use of several languages in one MPI program may require the use of special options at compile and/or link time. ( End of advice to users.)

    Advice to implementors.

    Implementations may selectively link language specific MPI libraries only to codes that need them, so as not to increase the size of binaries for codes that use only one language. The MPI initialization code need perform initialization for a language only if that language library is loaded. ( End of advice to implementors.)



    Up: Language Interoperability Next: Transfer of Handles Previous: Assumptions


    4.12.4. Transfer of Handles


    Up: Language Interoperability Next: Status Previous: Initialization

    Handles are passed between Fortran and C or C++ by using an explicit C wrapper to convert Fortran handles to C handles. There is no direct access to C or C++ handles in Fortran. Handles are passed between C and C++ using overloaded C++ operators called from C++ code. There is no direct access to C++ objects from C.

    The type definition MPI_Fint is provided in C/C++ for an integer of the size that matches a Fortran INTEGER; usually, MPI_Fint will be equivalent to int.

    The following functions are provided in C to convert from a Fortran communicator handle (which is an integer) to a C communicator handle, and vice versa.

    MPI_Comm MPI_Comm_f2c(MPI_Fint comm)

    If comm is a valid Fortran handle to a communicator, then MPI_Comm_f2c returns a valid C handle to that same communicator; if comm = MPI_COMM_NULL (Fortran value), then MPI_Comm_f2c returns a null C handle; if comm is an invalid Fortran handle, then MPI_Comm_f2c returns an invalid C handle.

    MPI_Fint MPI_Comm_c2f(MPI_Comm comm)

    The function MPI_Comm_c2f translates a C communicator handle into a Fortran handle to the same communicator; it maps a null handle into a null handle and an invalid handle into an invalid handle.

    Similar functions are provided for the other types of opaque objects.

    MPI_Datatype MPI_Type_f2c(MPI_Fint datatype)

    MPI_Fint MPI_Type_c2f(MPI_Datatype datatype)

    MPI_Group MPI_Group_f2c(MPI_Fint group)

    MPI_Fint MPI_Group_c2f(MPI_Group group)

    MPI_Request MPI_Request_f2c(MPI_Fint request)

    MPI_Fint MPI_Request_c2f(MPI_Request request)

    MPI_File MPI_File_f2c(MPI_Fint file)

    MPI_Fint MPI_File_c2f(MPI_File file)

    MPI_Win MPI_Win_f2c(MPI_Fint win)

    MPI_Fint MPI_Win_c2f(MPI_Win win)

    MPI_Op MPI_Op_f2c(MPI_Fint op)

    MPI_Fint MPI_Op_c2f(MPI_Op op)

    MPI_Info MPI_Info_f2c(MPI_Fint info)

    MPI_Fint MPI_Info_c2f(MPI_Info info)


    Example The example below illustrates how the Fortran MPI function MPI_TYPE_COMMIT can be implemented by wrapping the C MPI function MPI_Type_commit with a C wrapper to do handle conversions. In this example a Fortran-C interface is assumed where a Fortran function is all upper case when referred to from C and arguments are passed by addresses.


    ! FORTRAN PROCEDURE 
    SUBROUTINE MPI_TYPE_COMMIT( DATATYPE, IERR) 
    INTEGER DATATYPE, IERR 
    CALL MPI_X_TYPE_COMMIT(DATATYPE, IERR) 
    RETURN 
    END 
    

    /* C wrapper */ 
     
    void MPI_X_TYPE_COMMIT( MPI_Fint *f_handle, MPI_Fint *ierr) 
    { 
    MPI_Datatype datatype; 
     
    datatype = MPI_Type_f2c( *f_handle); 
    *ierr = (MPI_Fint)MPI_Type_commit( &datatype); 
    *f_handle = MPI_Type_c2f(datatype); 
    return; 
    } 
    
    The same approach can be used for all other MPI functions. The call to MPI_xxx_f2c (resp. MPI_xxx_c2f) can be omitted when the handle is an OUT (resp. IN) argument, rather than INOUT.


    Rationale.

    The design here provides a convenient solution for the prevalent case, where a C wrapper is used to allow Fortran code to call a C library, or C code to call a Fortran library. The use of C wrappers is much more likely than the use of Fortran wrappers, because it is much more likely that a variable of type INTEGER can be passed to C, than a C handle can be passed to Fortran.

    Returning the converted value as a function value rather than through the argument list allows the generation of efficient inlined code when these functions are simple (e.g., the identity). The conversion function in the wrapper does not catch an invalid handle argument. Instead, an invalid handle is passed below to the library function, which, presumably, checks its input arguments. ( End of rationale.)

    C and C++ The C++ language interface provides the functions listed below for mixed-language interoperability. The token <CLASS> is used below to indicate any valid MPI opaque handle name (e.g., Group), except where noted. For the case where the C++ class corresponding to <CLASS> has derived classes, functions are also provided for converting between the derived classes and the C MPI_<CLASS>.

    The following function allows assignment from a C MPI handle to a C++ MPI handle.

    MPI::<CLASS>& MPI::<CLASS>::operator=(const MPI_<CLASS>& data)

    The constructor below creates a C++ MPI object from a C MPI handle. This allows the automatic promotion of a C MPI handle to a C++ MPI handle.

    MPI::<CLASS>::<CLASS>(const MPI_<CLASS>& data)


    Example In order for a C program to use a C++ library, the C++ library must export a C interface that provides appropriate conversions before invoking the underlying C++ library call. This example shows a C interface function that invokes a C++ library call with a C communicator; the communicator is automatically promoted to a C++ handle when the underlying C++ function is invoked.


    // C++ library function prototype 
    void cpp_lib_call(MPI::Comm& cpp_comm); 
     
    // Exported C function prototype 
    extern "C" { 
    void c_interface(MPI_Comm c_comm); 
    } 
     
    void c_interface(MPI_Comm c_comm) 
    { 
    // the MPI_Comm (c_comm) is automatically promoted to MPI::Comm 
    cpp_lib_call(c_comm); 
    } 
    

    The following function allows conversion from C++ objects to C MPI handles. In this case, the casting operator is overloaded to provide the functionality.

    MPI::<CLASS>::operator MPI_<CLASS>() const


    Example A C library routine is called from a C++ program. The C library routine is prototyped to take an MPI_Comm as an argument.


    // C function prototype 
    extern "C" { 
    void c_lib_call(MPI_Comm c_comm); 
    } 
     
    void cpp_function() 
    { 
    // Create a C++ communicator, and initialize it with a dup of 
    //   MPI::COMM_WORLD 
    MPI::Intracomm cpp_comm(MPI::COMM_WORLD.Dup()); 
    c_lib_call(cpp_comm); 
    } 
    


    Rationale.

    Providing conversion from C to C++ via constructors and from C++ to C via casting allows the compiler to make automatic conversions. Calling C from C++ becomes trivial, as does the provision of a C or Fortran interface to a C++ library. ( End of rationale.)

    Advice to users.

    Note that the casting and promotion operators return new handles by value. Using these new handles as INOUT parameters will affect the internal MPI object, but will not affect the original handle from which it was cast. ( End of advice to users.)

    It is important to note that all C++ objects and their corresponding C handles can be used interchangeably by an application. For example, an application can cache an attribute on MPI_COMM_WORLD and later retrieve it from MPI::COMM_WORLD.



    Up: Language Interoperability Next: Status Previous: Initialization


    4.12.5. Status


    Up: Language Interoperability Next: MPI Opaque Objects Previous: Transfer of Handles

    The following two procedures are provided in C to convert from a Fortran status (which is an array of integers) to a C status (which is a structure), and vice versa. The conversion occurs on all the information in status, including that which is hidden. That is, no status information is lost in the conversion.
    int MPI_Status_f2c(MPI_Fint *f_status, MPI_Status *c_status)

    If f_status is a valid Fortran status, but not the Fortran value of MPI_STATUS_IGNORE or MPI_STATUSES_IGNORE, then MPI_Status_f2c returns in c_status a valid C status with the same content. If f_status is the Fortran value of MPI_STATUS_IGNORE or MPI_STATUSES_IGNORE, or if f_status is not a valid Fortran status, then the call is erroneous.

    The C status has the same source, tag and error code values as the Fortran status, and returns the same answers when queried for count, elements, and cancellation. The conversion function may be called with a Fortran status argument that has an undefined error field, in which case the value of the error field in the C status argument is undefined.

    Two global variables of type MPI_Fint*, MPI_F_STATUS_IGNORE and MPI_F_STATUSES_IGNORE are declared in mpi.h. They can be used to test, in C, whether f_status is the Fortran value of MPI_STATUS_IGNORE or MPI_STATUSES_IGNORE, respectively. These are global variables, not C constant expressions and cannot be used in places where C requires constant expressions. Their value is defined only between the calls to MPI_INIT and MPI_FINALIZE and should not be changed by user code.

    To do the conversion in the other direction, we have the following:

    int MPI_Status_c2f(MPI_Status *c_status, MPI_Fint *f_status)

    This call converts a C status into a Fortran status, and has a behavior similar to MPI_Status_f2c. That is, the value of c_status must not be either MPI_STATUS_IGNORE or MPI_STATUSES_IGNORE.


    Advice to users.

    There is not a separate conversion function for arrays of statuses, since one can simply loop through the array, converting each status. ( End of advice to users.)

    Rationale.

    The handling of MPI_STATUS_IGNORE is required in order to layer libraries with only a C wrapper: if the Fortran call has passed MPI_STATUS_IGNORE, then the C wrapper must handle this correctly. Note that this constant need not have the same value in Fortran and C. If MPI_Status_f2c were to handle MPI_STATUS_IGNORE, then the type of its result would have to be MPI_Status**, which was considered an inferior solution. ( End of rationale.)



    Up: Language Interoperability Next: MPI Opaque Objects Previous: Transfer of Handles


    4.12.6. MPI Opaque Objects


    Up: Language Interoperability Next: Datatypes Previous: Status

    Unless said otherwise, opaque objects are ``the same'' in all languages: they carry the same information, and have the same meaning in both languages. The mechanism described in the previous section can be used to pass references to MPI objects from language to language. An object created in one language can be accessed, modified or freed in another language.

    We examine below in more detail, issues that arise for each type of MPI object.



    Up: Language Interoperability Next: Datatypes Previous: Status


    4.12.6.1. Datatypes


    Up: MPI Opaque Objects Next: Callback Functions Previous: MPI Opaque Objects

    Datatypes encode the same information in all languages. E.g., a datatype accessor like MPI_TYPE_GET_EXTENT will return the same information in all languages. If a datatype defined in one language is used for a communication call in another language, then the message sent will be identical to the message that would be sent from the first language: the same communication buffer is accessed, and the same representation conversion is performed, if needed. All predefined datatypes can be used in datatype constructors in any language. If a datatype is committed, it can be used for communication in any language.

    The function MPI_GET_ADDRESS returns the same value in all languages. Note that we do not require that the constant MPI_BOTTOM have the same value in all languages (see Constants ).


    Example

    ! FORTRAN CODE 
    REAL R(5) 
    INTEGER TYPE, IERR 
    INTEGER (KIND=MPI_ADDRESS_KIND) ADDR 
     
    ! create an absolute datatype for array R 
    CALL MPI_GET_ADDRESS( R, ADDR, IERR) 
    CALL MPI_TYPE_CREATE_STRUCT(1, 5, ADDR, MPI_REAL, TYPE, IERR) 
    CALL C_ROUTINE(TYPE) 
    

    /* C code */ 
     
    void C_ROUTINE(MPI_Fint *ftype) 
    { 
    int count = 5; 
    int lens[2] = {1,1}; 
    MPI_Aint displs[2]; 
    MPI_Datatype types[2], newtype; 
     
    /* create an absolute datatype for buffer that consists   */ 
    /*  of count, followed by R(5)                            */ 
     
    MPI_Get_address(&count, &displs[0]); 
    displs[1] = 0; 
    types[0] = MPI_INT; 
    types[1] = MPI_Type_f2c(*ftype); 
    MPI_Type_create_struct(2, lens, displs, types, &newtype); 
    MPI_Type_commit(&newtype); 
     
    MPI_Send(MPI_BOTTOM, 1, newtype, 1, 0, MPI_COMM_WORLD); 
    /* the message sent contains an int count of 5, followed  */ 
    /* by the 5 REAL entries of the Fortran array R.          */ 
    } 
    


    Advice to implementors.

    The following implementation can be used: MPI addresses, as returned by MPI_GET_ADDRESS, will have the same value in all languages. One obvious choice is that MPI addresses be identical to regular addresses. The address is stored in the datatype, when datatypes with absolute addresses are constructed. When a send or receive operation is performed, then addresses stored in a datatype are interpreted as displacements that are all augmented by a base address. This base address is (the address of) buf, or zero, if buf = MPI_BOTTOM. Thus, if MPI_BOTTOM is zero then a send or receive call with buf = MPI_BOTTOM is implemented exactly as a call with a regular buffer argument: in both cases the base address is buf. On the other hand, if MPI_BOTTOM is not zero, then the implementation has to be slightly different. A test is performed to check whether buf = MPI_BOTTOM. If true, then the base address is zero, otherwise it is buf. In particular, if MPI_BOTTOM does not have the same value in Fortran and C/C++, then an additional test for buf = MPI_BOTTOM is needed in at least one of the languages.

    It may be desirable to use a value other than zero for MPI_BOTTOM even in C/C++, so as to distinguish it from a NULL pointer. If MPI_BOTTOM = c then one can still avoid the test buf = MPI_BOTTOM, by using the displacement from MPI_BOTTOM, i.e., the regular address - c, as the MPI address returned by MPI_GET_ADDRESS and stored in absolute datatypes. ( End of advice to implementors.)



    Up: MPI Opaque Objects Next: Callback Functions Previous: MPI Opaque Objects


    4.12.6.2. Callback Functions


    Up: MPI Opaque Objects Next: Error Handlers Previous: Datatypes

    MPI calls may associate callback functions with MPI objects: error handlers are associated with communicators and files, attribute copy and delete functions are associated with attribute keys, reduce operations are assciated with operation objects, etc. In a multilanguage environment, a function passed in an MPI call in one language may be invoked by an MPI call in another language. MPI implementations must make sure that such invocation will use the calling convention of the language the function is bound to.


    Advice to implementors.

    Callback functions need to have a language tag. This tag is set when the callback function is passed in by the library function (which is presumably different for each language), and is used to generate the right calling sequence when the callback function is invoked. ( End of advice to implementors.)



    Up: MPI Opaque Objects Next: Error Handlers Previous: Datatypes


    4.12.6.3. Error Handlers


    Up: MPI Opaque Objects Next: Reduce Operations Previous: Callback Functions


    Advice to implementors.

    Error handlers, have, in C and C++, a `` stdargs'' argument list. It might be useful to provide to the handler information on the language environment where the error occurred. ( End of advice to implementors.)



    Up: MPI Opaque Objects Next: Reduce Operations Previous: Callback Functions


    4.12.6.4. Reduce Operations


    Up: MPI Opaque Objects Next: Addresses Previous: Error Handlers


    Advice to users.

    Reduce operations receive as one of their arguments the datatype of the operands. Thus, one can define ``polymorphic'' reduce operations that work for C, C++, and Fortran datatypes. ( End of advice to users.)



    Up: MPI Opaque Objects Next: Addresses Previous: Error Handlers


    4.12.6.5. Addresses


    Up: MPI Opaque Objects Next: Attributes Previous: Reduce Operations

    Some of the datatype accessors and constructors have arguments of type MPI_Aint (in C) or MPI::Aint in C++, to hold addresses. The corresponding arguments, in Fortran, have type INTEGER. This causes Fortran and C/C++ to be incompatible, in an environment where addresses have 64 bits, but Fortran INTEGERs have 32 bits.

    This is a problem, irrespective of interlanguage issues. Suppose that a Fortran process has an address space of 4 GB. What should be the value returned in Fortran by MPI_ADDRESS, for a variable with an address above 232? The design described here addresses this issue, while maintaining compatibility with current Fortran codes.

    The constant MPI_ADDRESS_KIND is defined so that, in Fortran 90,
    INTEGER(KIND=MPI_ADDRESS_KIND)) is an address sized integer type (typically, but not necessarily, the size of an INTEGER(KIND=MPI_ADDRESS_KIND) is 4 on 32 bit address machines and 8 on 64 bit address machines). Similarly, the constant MPI_INTEGER_KIND is defined so that INTEGER(KIND=MPI_INTEGER_KIND) is a default size INTEGER.

    There are seven functions that have address arguments: MPI_TYPE_HVECTOR,
    MPI_TYPE_HINDEXED, MPI_TYPE_STRUCT, MPI_ADDRESS, MPI_TYPE_EXTENT
    MPI_TYPE_LB and MPI_TYPE_UB.

    Four new functions are provided to supplement the first four functions in this list. These functions are described in Section New Datatype Manipulation Functions . The remaining three functions are supplemented by the new function MPI_TYPE_GET_EXTENT, described in that same section. The new functions have the same functionality as the old functions in C/C++, or on Fortran systems where default INTEGERs are address sized. In Fortran, they accept arguments of type INTEGER(KIND=MPI_ADDRESS_KIND), wherever arguments of type MPI_Aint are used in C. On Fortran 77 systems that do not support the Fortran 90 KIND notation, and where addresses are 64 bits whereas default INTEGERs are 32 bits, these arguments will be of an appropriate integer type. The old functions will continue to be provided, for backward compatibility. However, users are encouraged to switch to the new functions, in Fortran, so as to avoid problems on systems with an address range > 232, and to provide compatibility across languages.



    Up: MPI Opaque Objects Next: Attributes Previous: Reduce Operations


    4.12.7. Attributes


    Up: Language Interoperability Next: Extra State Previous: Addresses

    Attribute keys can be allocated in one language and freed in another. Similarly, attribute values can be set in one language and accessed in another. To achieve this, attribute keys will be allocated in an integer range that is valid all languages. The same holds true for system-defined attribute values (such as MPI_TAG_UB, MPI_WTIME_IS_GLOBAL, etc.)

    Attribute keys declared in one language are associated with copy and delete functions in that language (the functions provided by the MPI_{TYPE,COMM,WIN}_KEYVAL_CREATE call). When a communicator is duplicated, for each attribute, the corresponding copy function is called, using the right calling convention for the language of that function; and similarly, for the delete callback function.


    Advice to implementors.

    This requires that attributes be tagged either as ``C,'' ``C++'' or ``Fortran,'' and that the language tag be checked in order to use the right calling convention for the callback function. ( End of advice to implementors.)
    The attribute manipulation functions described in Section 5.7 of the MPI-1 standard define attributes arguments to be of type void* in C, and of type INTEGER, in Fortran. On some systems, INTEGERs will have 32 bits, while C/C++ pointers will have 64 bits. This is a problem if communicator attributes are used to move information from a Fortran caller to a C/C++ callee, or vice-versa.

    MPI will store, internally, address sized attributes. If Fortran INTEGERs are smaller, then the Fortran function MPI_ATTR_GET will return the least significant part of the attribute word; the Fortran function MPI_ATTR_PUT will set the least significant part of the attribute word, which will be sign extended to the entire word. (These two functions may be invoked explicitly by user code, or implicitly, by attribute copying callback functions.)

    As for addresses, new functions are provided that manipulate Fortran address sized attributes, and have the same functionality as the old functions in C/C++. These functions are described in Section New Attribute Caching Functions . Users are encouraged to use these new functions.

    MPI supports two types of attributes: address-valued (pointer) attributes, and integer valued attributes. C and C++ attribute functions put and get address valued attributes. Fortran attribute functions put and get integer valued attributes. When an integer valued attribute is accessed from C or C++, then MPI_xxx_get_attr will return the address of (a pointer to) the integer valued attribute. When an address valued attribute is accessed from Fortran, then MPI_xxx_GET_ATTR will convert the address into an integer and return the result of this conversion. This conversion is lossless if new style ( MPI-2) attribute functions are used, and an integer of kind MPI_ADDRESS_KIND is returned. The conversion may cause truncation if old style ( MPI-1)attribute functions are used.


    Example

    A. C to Fortran


      C code 
     
    static int i = 5; 
    void *p; 
    p = &i; 
    MPI_Comm_put_attr(..., p); 
    .... 
     
     Fortran code 
     
    INTEGER(kind = MPI_ADDRESS_KIND) val 
    CALL MPI_COMM_GET_ATTR(...,val,...) 
    IF(val.NE.5) THEN CALL ERROR 
    
    B. Fortran to C


       Fortran code 
     
    INTEGER(kind=MPI_ADDRESS_KIND) val 
    val = 55555 
    CALL MPI_COMM_PUT_ATTR(...,val,ierr) 
     
       C code 
     
    int *p; 
    MPI_Comm_get_attr(...,&p, ...); 
    if (*p != 55555) error(); 
    

    The predefined MPI attributes can be integer valued or address valued. Predefined integer valued attributes, such as MPI_TAG_UB, behave as if they were put by a Fortran call. I.e., in Fortran, MPI_COMM_GET_ATTR(MPI_COMM_WORLD, MPI_TAG_UB, val, flag, ierr) will return in val the upper bound for tag value; in C, MPI_Comm_get_attr(MPI_COMM_WORLD, MPI_TAG_UB, &p, &flag) will return in p a pointer to an int containing the upper bound for tag value.

    Address valued predefined attributes, such as MPI_WIN_BASE behave as if they were put by a C call. I.e., in Fortran, MPI_WIN_GET_ATTR(win, MPI_WIN_BASE, val, flag, ierror) will return in val the base address of the window, converted to an integer. In C, MPI_Win_get_attr(win, MPI_WIN_BASE, &p, &flag) will return in p a pointer to the window base, cast to (void *).


    Rationale.

    The design is consistent with the behavior specified in MPI-1 for predefined attributes, and ensures that no information is lost when attributes are passed from language to language. ( End of rationale.)

    Advice to implementors.

    Implementations should tag attributes either as address attributes or as integer attributes, according to whether they were set in C or in Fortran. Thus, the right choice can be made when the attribute is retrieved. ( End of advice to implementors.)



    Up: Language Interoperability Next: Extra State Previous: Addresses


    4.12.8. Extra State


    Up: Language Interoperability Next: Constants Previous: Attributes

    Extra-state should not be modified by the copy or delete callback functions. (This is obvious from the C binding, but not obvious from the Fortran binding). However, these functions may update state that is indirectly accessed via extra-state. E.g., in C, extra-state can be a pointer to a data structure that is modified by the copy or callback functions; in Fortran, extra-state can be an index into an entry in a COMMON array that is modified by the copy or callback functions. In a multithreaded environment, users should be aware that distinct threads may invoke the same callback function concurrently: if this function modifies state associated with extra-state, then mutual exclusion code must be used to protect updates and accesses to the shared state.



    Up: Language Interoperability Next: Constants Previous: Attributes


    4.12.9. Constants


    Up: Language Interoperability Next: Interlanguage Communication Previous: Extra State

    MPI constants have the same value in all languages, unless specified otherwise. This does not apply to constant handles ( MPI_INT, MPI_COMM_WORLD, MPI_ERRORS_RETURN, MPI_SUM, etc.) These handles need to be converted, as explained in Section Transfer of Handles . Constants that specify maximum lengths of strings (see Section for a listing) have a value one less in Fortran than C/C++ since in C/C++ the length includes the null terminating character. Thus, these constants represent the amount of space which must be allocated to hold the largest possible such string, rather than the maximum number of printable characters the string could contain.


    Advice to users.

    This definition means that it is safe in C/C++ to allocate a buffer to receive a string using a declaration like

            char name [MPI_MAX_NAME_STRING]; 
    
    ( End of advice to users.)

    Also constant ``addresses,'' i.e., special values for reference arguments that are not handles, such as MPI_BOTTOM or MPI_STATUS_IGNORE may have different values in different languages.


    Rationale.

    The current MPI standard specifies that MPI_BOTTOM can be used in initialization expressions in C, but not in Fortran. Since Fortran does not normally support call by value, then MPI_BOTTOM must be in Fortran the name of a predefined static variable, e.g., a variable in an MPI declared COMMON block. On the other hand, in C, it is natural to take MPI_BOTTOM = 0 (Caveat: Defining MPI_BOTTOM = 0 implies that NULL pointer cannot be distinguished from MPI_BOTTOM; it may be that MPI_BOTTOM = 1 is better ...) Requiring that the Fortran and C values be the same will complicate the initialization process. ( End of rationale.)



    Up: Language Interoperability Next: Interlanguage Communication Previous: Extra State


    4.12.10. Interlanguage Communication


    Up: Language Interoperability Next: Error Handlers Previous: Constants

    The type matching rules for communications in MPI are not changed: the datatype specification for each item sent should match, in type signature, the datatype specification used to receive this item (unless one of the types is MPI_PACKED). Also, the type of a message item should match the type declaration for the corresponding communication buffer location, unless the type is MPI_BYTE or MPI_PACKED. Interlanguage communication is allowed if it complies with these rules.


    Example In the example below, a Fortran array is sent from Fortran and received in C.


    ! FORTRAN CODE 
    REAL R(5) 
    INTEGER TYPE, IERR, MYRANK 
    INTEGER(KIND=MPI_ADDRESS_KIND) ADDR 
     
    ! create an absolute datatype for array R 
    CALL MPI_GET_ADDRESS( R, ADDR, IERR) 
    CALL MPI_TYPE_CREATE_STRUCT(1, 5, ADDR, MPI_REAL, TYPE, IERR) 
    CALL MPI_TYPE_COMMIT(TYPE, IERR) 
     
    CALL MPI_COMM_RANK( MPI_COMM_WORLD, MYRANK, IERR) 
    IF (MYRANK.EQ.0) THEN 
       CALL MPI_SEND( MPI_BOTTOM, 1, TYPE, 1, 0, MPI_COMM_WORLD, IERR) 
    ELSE 
       CALL C_ROUTINE(TYPE) 
    END IF 
    

    /* C code */ 
     
    void C_ROUTINE(MPI_Fint *fhandle) 
    { 
    MPI_Datatype type; 
    MPI_Status status; 
     
    type = MPI_Type_f2c(*fhandle); 
     
    MPI_Recv( MPI_BOTTOM, 1, type, 0, 0, MPI_COMM_WORLD, &status); 
    } 
    

    MPI implementors may weaken these type matching rules, and allow messages to be sent with Fortran types and received with C types, and vice versa, when those types match. I.e., if the Fortran type INTEGER is identical to the C type int, then an MPI implementation may allow data to be sent with datatype MPI_INTEGER and be received with datatype MPI_INT. However, such code is not portable.



    Up: Language Interoperability Next: Error Handlers Previous: Constants


    4.13. Error Handlers


    Up: Miscellany Next: Error Handlers for Communicators Previous: Interlanguage Communication

    MPI-1 attached error handlers only to communicators. MPI-2 attaches them to three types of objects: communicators, windows, and files. The extension was done while maintaining only one type of error handler opaque object. On the other hand, there are, in C and C++, distinct typedefs for user defined error handling callback functions that accept, respectively, communicator, file, and window arguments. In Fortran there are three user routines.

    An error handler object is created by a call to MPI_XXX_CREATE_ERRHANDLER(function, errhandler), where XXX is, respectively, COMM, WIN, or FILE.

    An error handler is attached to a communicator, window, or file by a call to MPI_XXX_SET_ERRHANDLER. The error handler must be either a predefined error handler, or an error handler that was created by a call to MPI_XXX_CREATE_ERRHANDLER, with matching XXX. The predefined error handlers MPI_ERRORS_RETURN and MPI_ERRORS_ARE_FATAL can be attached to communicators, windows, and files. In C++, the predefined error handler MPI::ERRORS_THROW_EXCEPTIONS can also be attached to communicators, windows, and files.

    The error handler currently associated with a communicator, window, or file can be retrieved by a call to MPI_XXX_GET_ERRHANDLER.

    The MPI-1 function MPI_ERRHANDLER_FREE can be used to free an error handler that was created by a call to MPI_XXX_CREATE_ERRHANDLER.


    Advice to implementors.

    High quality implementation should raise an error when an error handler that was created by a call to MPI_XXX_CREATE_ERRHANDLER is attached to an object of the wrong type with a call to MPI_YYY_SET_ERRHANDLER. To do so, it is necessary to maintain, with each error handler, information on the typedef of the associated user function. ( End of advice to implementors.)
    The syntax for these calls is given below.



    Up: Miscellany Next: Error Handlers for Communicators Previous: Interlanguage Communication


    4.13.1. Error Handlers for Communicators


    Up: Error Handlers Next: Error Handlers for Windows Previous: Error Handlers

    MPI_COMM_CREATE_ERRHANDLER(function, errhandler)
    IN functionuser defined error handling procedure (function)
    OUT errhandler MPI error handler (handle)

    int MPI_Comm_create_errhandler(MPI_Comm_errhandler_fn *function, MPI_Errhandler *errhandler)

    MPI_COMM_CREATE_ERRHANDLER(FUNCTION, ERRHANDLER, IERROR)
    EXTERNAL FUNCTION
    INTEGER ERRHANDLER, IERROR

    static MPI::Errhandler MPI::Comm::Create_errhandler(MPI::Comm::Errhandler_fn* function)

    Creates an error handler that can be attached to communicators. This function is identical to MPI_ERRHANDLER_CREATE, whose use is deprecated.

    The user routine should be, in C, a function of type MPI_Comm_errhandler_fn, which is defined as

    typedef void MPI_Comm_errhandler_fn(MPI_Comm *, int *, ...);

    The first argument is the communicator in use, the second is the error code to be returned. This typedef replaces MPI_Handler_function, whose use is deprecated.

    In Fortran, the user routine should be of the form:

    SUBROUTINE COMM_ERRHANDLER_FN(COMM, ERROR_CODE, ... )
    INTEGER COMM, ERROR_CODE

    In C++, the user routine should be of the form:

    typedef void MPI::Comm::Errhandler_fn(MPI::Comm &, int *, ... );

    MPI_COMM_SET_ERRHANDLER(comm, errhandler)
    INOUT commcommunicator (handle)
    IN errhandlernew error handler for communicator (handle)

    int MPI_Comm_set_errhandler(MPI_Comm comm, MPI_Errhandler errhandler)

    MPI_COMM_SET_ERRHANDLER(COMM, ERRHANDLER, IERROR)
    INTEGER COMM, ERRHANDLER, IERROR

    void MPI::Comm::Set_errhandler(const MPI::Errhandler& errhandler)

    Attaches a new error handler to a communicator. The error handler must be either a predefined error handler, or an error handler created by a call to MPI_COMM_CREATE_ERRHANDLER. This call is identical to MPI_ERRHANDLER_SET, whose use is deprecated.

    MPI_COMM_GET_ERRHANDLER(comm, errhandler)
    IN commcommunicator (handle)
    OUT errhandlererror handler currently associated with communicator (handle)

    int MPI_Comm_get_errhandler(MPI_Comm comm, MPI_Errhandler *errhandler)

    MPI_COMM_GET_ERRHANDLER(COMM, ERRHANDLER, IERROR)
    INTEGER COMM, ERRHANDLER, IERROR

    MPI::Errhandler MPI::Comm::Get_errhandler() const

    Retrieves the error handler currently associated with a communicator. This call is identical to MPI_ERRHANDLER_GET, whose use is deprecated.



    Up: Error Handlers Next: Error Handlers for Windows Previous: Error Handlers


    4.13.2. Error Handlers for Windows


    Up: Error Handlers Next: Error Handlers for Files Previous: Error Handlers for Communicators

    MPI_WIN_CREATE_ERRHANDLER(function, errhandler)
    IN functionuser defined error handling procedure (function)
    OUT errhandler MPI error handler (handle)

    int MPI_Win_create_errhandler(MPI_Win_errhandler_fn *function, MPI_Errhandler *errhandler)

    MPI_WIN_CREATE_ERRHANDLER(FUNCTION, ERRHANDLER, IERROR)
    EXTERNAL FUNCTION
    INTEGER ERRHANDLER, IERROR

    static MPI::Errhandler MPI::Win::Create_errhandler(MPI::Win::Errhandler_fn* function)

    The user routine should be, in C, a function of type MPI_Win_errhandler_fn, which is defined as

    typedef void MPI_Win_errhandler_fn(MPI_Win *, int *, ...);

    The first argument is the window in use, the second is the error code to be returned.

    In Fortran, the user routine should be of the form:

    SUBROUTINE WIN_ERRHANDLER_FN(WIN, ERROR_CODE, ... )
    INTEGER WIN, ERROR_CODE

    In C++, the user routine should be of the form:

    typedef void MPI::Win::Errhandler_fn(MPI::Win &, int *, ... );

    MPI_WIN_SET_ERRHANDLER(win, errhandler)
    INOUT winwindow (handle)
    IN errhandlernew error handler for window (handle)

    int MPI_Win_set_errhandler(MPI_Win win, MPI_Errhandler errhandler)

    MPI_WIN_SET_ERRHANDLER(WIN, ERRHANDLER, IERROR)
    INTEGER WIN, ERRHANDLER, IERROR

    void MPI::Win::Set_errhandler(const MPI::Errhandler& errhandler)

    Attaches a new error handler to a window. The error handler must be either a predefined error handler, or an error handler created by a call to MPI_WIN_CREATE_ERRHANDLER.

    MPI_WIN_GET_ERRHANDLER(win, errhandler)
    IN winwindow (handle)
    OUT errhandlererror handler currently associated with window (handle)

    int MPI_Win_get_errhandler(MPI_Win win, MPI_Errhandler *errhandler)

    MPI_WIN_GET_ERRHANDLER(WIN, ERRHANDLER, IERROR)
    INTEGER WIN, ERRHANDLER, IERROR

    MPI::Errhandler MPI::Win::Get_errhandler() const

    Retrieves the error handler currently associated with a window.



    Up: Error Handlers Next: Error Handlers for Files Previous: Error Handlers for Communicators


    4.13.3. Error Handlers for Files


    Up: Error Handlers Next: New Datatype Manipulation Functions Previous: Error Handlers for Windows

    MPI_FILE_CREATE_ERRHANDLER(function, errhandler)
    IN functionuser defined error handling procedure (function)
    OUT errhandler MPI error handler (handle)

    int MPI_File_create_errhandler(MPI_File_errhandler_fn *function, MPI_Errhandler *errhandler)

    MPI_FILE_CREATE_ERRHANDLER(FUNCTION, ERRHANDLER, IERROR)
    EXTERNAL FUNCTION
    INTEGER ERRHANDLER, IERROR

    static MPI::Errhandler MPI::File::Create_errhandler(MPI::File::Errhandler_fn* function)

    The user routine should be, in C, a function of type MPI_File_errhandler_fn, which is defined as

    typedef void MPI_File_errhandler_fn(MPI_File *, int *, ...);

    The first argument is the file in use, the second is the error code to be returned.

    In Fortran, the user routine should be of the form:

    SUBROUTINE FILE_ERRHANDLER_FN(FILE, ERROR_CODE, ... )
    INTEGER FILE, ERROR_CODE

    In C++, the user routine should be of the form:

    typedef void MPI::File::Errhandler_fn(MPI::File &, int *, ... );

    MPI_FILE_SET_ERRHANDLER(file, errhandler)
    INOUT filefile (handle)
    IN errhandlernew error handler for file (handle)

    int MPI_File_set_errhandler(MPI_File file, MPI_Errhandler errhandler)

    MPI_FILE_SET_ERRHANDLER(FILE, ERRHANDLER, IERROR)
    INTEGER FILE, ERRHANDLER, IERROR

    void MPI::File::Set_errhandler(const MPI::Errhandler& errhandler)

    Attaches a new error handler to a file. The error handler must be either a predefined error handler, or an error handler created by a call to MPI_FILE_CREATE_ERRHANDLER.

    MPI_FILE_GET_ERRHANDLER(file, errhandler)
    IN filefile (handle)
    OUT errhandlererror handler currently associated with file (handle)

    int MPI_File_get_errhandler(MPI_File file, MPI_Errhandler *errhandler)

    MPI_FILE_GET_ERRHANDLER(FILE, ERRHANDLER, IERROR)
    INTEGER FILE, ERRHANDLER, IERROR

    MPI::Errhandler MPI::File::Get_errhandler() const

    Retrieves the error handler currently associated with a file.



    Up: Error Handlers Next: New Datatype Manipulation Functions Previous: Error Handlers for Windows


    4.14. New Datatype Manipulation Functions


    Up: Miscellany Next: Type Constructors with Explicit Addresses Previous: Error Handlers for Files

    New functions are provided to supplement the type manipulation functions that have address sized integer arguments. The new functions will use, in their Fortran binding, address-sized INTEGERs, thus solving problems currently encountered when the application address range is > 232. Also, a new, more convenient type constructor is provided to modify the lower bound and extent of a datatype. The deprecated functions replaced by the new functions here are listed in Section Deprecated Names and Functions .



    Up: Miscellany Next: Type Constructors with Explicit Addresses Previous: Error Handlers for Files


    4.14.1. Type Constructors with Explicit Addresses


    Up: New Datatype Manipulation Functions Next: Extent and Bounds of Datatypes Previous: New Datatype Manipulation Functions

    The four functions below supplement the four corresponding type constructor functions from MPI-1. The new functions are synonymous with the old functions in C/C++, or on Fortran systems where default INTEGERs are address sized. (The old names are not available in C++.) In Fortran, these functions accept arguments of type INTEGER(KIND=MPI_ADDRESS_KIND), wherever arguments of type MPI_Aint are used in C. On Fortran 77 systems that do not support the Fortran 90 KIND notation, and where addresses are 64 bits whereas default INTEGERs are 32 bits, these arguments will be of type INTEGER*8. The old functions will continue to be provided for backward compatibility. However, users are encouraged to switch to the new functions, in both Fortran and C.

    The new functions are listed below. The use of the old functions is deprecated.

    MPI_TYPE_CREATE_HVECTOR( count, blocklength, stride, oldtype, newtype)
    IN countnumber of blocks (nonnegative integer)
    IN blocklengthnumber of elements in each block (nonnegative integer)
    IN stridenumber of bytes between start of each block (integer)
    IN oldtypeold datatype (handle)
    OUT newtypenew datatype (handle)

    int MPI_Type_create_hvector(int count, int blocklength, MPI_Aint stride, MPI_Datatype oldtype, MPI_Datatype *newtype)

    MPI_TYPE_CREATE_HVECTOR(COUNT, BLOCKLENGTH, STIDE, OLDTYPE, NEWTYPE, IERROR)
    INTEGER COUNT, BLOCKLENGTH, OLDTYPE, NEWTYPE, IERROR
    INTEGER(KIND=MPI_ADDRESS_KIND) STRIDE

    MPI::Datatype MPI::Datatype::Create_hvector(int count, int blocklength, MPI::Aint stride) const

    MPI_TYPE_CREATE_HINDEXED( count, array_of_blocklengths, array_of_displacements, oldtype, newtype)
    IN countnumber of blocks --- also number of entries in
    array_of_displacements and array_of_blocklengths (integer)
    IN array_of_blocklengthsnumber of elements in each block (array of nonnegative integers)
    IN array_of_displacementsbyte displacement of each block (array of integer)
    IN oldtypeold datatype (handle)
    OUT newtypenew datatype (handle)

    int MPI_Type_create_hindexed(int count, int array_of_blocklengths[], MPI_Aint array_of_displacements[], MPI_Datatype oldtype, MPI_Datatype *newtype)

    MPI_TYPE_CREATE_HINDEXED(COUNT, ARRAY_OF_BLOCKLENGTHS, ARRAY_OF_DISPLACEMENTS, OLDTYPE, NEWTYPE, IERROR)
    INTEGER COUNT, ARRAY_OF_BLOCKLENGTHS(*), OLDTYPE, NEWTYPE, IERROR
    INTEGER(KIND=MPI_ADDRESS_KIND) ARRAY_OF_DISPLACEMENTS(*)

    MPI::Datatype MPI::Datatype::Create_hindexed(int count, const int array_of_blocklengths[], const MPI::Aint array_of_displacements[]) const

    MPI_TYPE_CREATE_STRUCT(count, array_of_blocklengths, array_of_displacements, array_of_types, newtype)
    IN countnumber of blocks (integer) --- also number of entries in arrays array_of_types, array_of_displacements and array_of_blocklengths
    IN array_of_blocklengthnumber of elements in each block (array of integer)
    IN array_of_displacementsbyte displacement of each block (array of integer)
    IN array_of_typestype of elements in each block (array of handles to datatype objects)
    OUT newtypenew datatype (handle)

    int MPI_Type_create_struct(int count, int array_of_blocklengths[], MPI_Aint array_of_displacements[], MPI_Datatype array_of_types[], MPI_Datatype *newtype)

    MPI_TYPE_CREATE_STRUCT(COUNT, ARRAY_OF_BLOCKLENGTHS, ARRAY_OF_DISPLACEMENTS, ARRAY_OF_TYPES, NEWTYPE, IERROR)
    INTEGER COUNT, ARRAY_OF_BLOCKLENGTHS(*), ARRAY_OF_TYPES(*), NEWTYPE, IERROR
    INTEGER(KIND=MPI_ADDRESS_KIND) ARRAY_OF_DISPLACEMENTS(*)

    static MPI::Datatype MPI::Datatype::Create_struct(int count, const int array_of_blocklengths[], const MPI::Aint array_of_displacements[], const MPI::Datatype array_of_types[])

    MPI_GET_ADDRESS(location, address)
    IN locationlocation in caller memory (choice)
    OUT addressaddress of location (integer)

    int MPI_Get_address(void *location, MPI_Aint *address)

    MPI_GET_ADDRESS(LOCATION, ADDRESS, IERROR)
    <type> LOCATION(*)
    INTEGER IERROR
    INTEGER(KIND=MPI_ADDRESS_KIND) ADDRESS

    MPI::Aint MPI::Get_address(void* location)


    Advice to users.

    Current Fortran MPI codes will run unmodified, and will port to any system. However, they may fail if addresses larger than 232 -1 are used in the program. New codes should be written so that they use the new functions. This provides compatibility with C/C++ and avoids errors on 64 bit architectures. However, such newly written codes may need to be (slightly) rewritten to port to old Fortran 77 environments that do not support KIND declarations. ( End of advice to users.)



    Up: New Datatype Manipulation Functions Next: Extent and Bounds of Datatypes Previous: New Datatype Manipulation Functions


    4.14.2. Extent and Bounds of Datatypes


    Up: New Datatype Manipulation Functions Next: True Extent of Datatypes Previous: Type Constructors with Explicit Addresses

    The following function replaces the three functions MPI_TYPE_UB, MPI_TYPE_LB and MPI_TYPE_EXTENT. It also returns address sized integers, in the Fortran binding. The use of MPI_TYPE_UB, MPI_TYPE_LB and MPI_TYPE_EXTENT is deprecated.

    MPI_TYPE_GET_EXTENT(datatype, lb, extent)
    IN datatypedatatype to get information on (handle)
    OUT lblower bound of datatype (integer)
    OUT extentextent of datatype (integer)

    int MPI_Type_get_extent(MPI_Datatype datatype, MPI_Aint *lb, MPI_Aint *extent)

    MPI_TYPE_GET_EXTENT(DATATYPE, LB, EXTENT, IERROR)
    INTEGER DATATYPE, IERROR
    INTEGER(KIND = MPI_ADDRESS_KIND) LB, EXTENT

    void MPI::Datatype::Get_extent(MPI::Aint& lb, MPI::Aint& extent) const

    Returns the lower bound and the extent of datatype (as defined by the MPI-1 standard, Section 3.12.2).

    MPI allows one to change the extent of a datatype, using lower bound and upper bound markers ( MPI_LB and MPI_UB). This is useful, as it allows to control the stride of successive datatypes that are replicated by datatype constructors, or are replicated by the count argument in a send or recieve call. However, the current mechanism for achieving it is painful; also it is restrictive. MPI_LB and MPI_UB are ``sticky'': once present in a datatype, they cannot be overridden (e.g., the upper bound can be moved up, by adding a new MPI_UB marker, but cannot be moved down below an existing MPI_UB marker). A new type constructor is provided to facilitate these changes. The use of MPI_LB and MPI_UB is deprecated.

    MPI_TYPE_CREATE_RESIZED(oldtype, lb, extent, newtype)
    IN oldtypeinput datatype (handle)
    IN lbnew lower bound of datatype (integer)
    IN extentnew extent of datatype (integer)
    OUT newtypeoutput datatype (handle)

    int MPI_Type_create_resized(MPI_Datatype oldtype, MPI_Aint lb, MPI_Aint extent, MPI_Datatype *newtype)

    MPI_TYPE_CREATE_RESIZED(OLDTYPE, LB, EXTENT, NEWTYPE, IERROR)
    INTEGER OLDTYPE, NEWTYPE, IERROR
    INTEGER(KIND=MPI_ADDRESS_KIND) LB, EXTENT

    MPI::Datatype MPI::Datatype::Resized(const MPI::Aint lb, const MPI::Aint extent) const

    Returns in newtype a handle to a new datatype that is identical to oldtype, except that the lower bound of this new datatype is set to be lb, and its upper bound is set to be lb + extent. Any previous lb and ub markers are erased, and a new pair of lower bound and upper bound markers are put in the positions indicated by the lb and extent arguments. This affects the behavior of the datatype when used in communication operations, with count >1, and when used in the construction of new derived datatypes.


    Advice to users.

    It is strongly recommended that users use these two new functions, rather than the old MPI-1 functions to set and access lower bound, upper bound and extent of datatypes. ( End of advice to users.)



    Up: New Datatype Manipulation Functions Next: True Extent of Datatypes Previous: Type Constructors with Explicit Addresses


    4.14.3. True Extent of Datatypes


    Up: New Datatype Manipulation Functions Next: Subarray Datatype Constructor Previous: Extent and Bounds of Datatypes

    Suppose we implement gather as a spanning tree implemented on top of point-to-point routines. Since the receive buffer is only valid on the root process, one will need to allocate some temporary space for receiving data on intermediate nodes. However, the datatype extent cannot be used as an estimate of the amount of space that needs to be allocated, if the user has modified the extent using the MPI_UB and MPI_LB values. A new function is provided which returns the true extent of the datatype.

    MPI_TYPE_GET_TRUE_EXTENT(datatype, true_lb, true_extent)
    IN datatypedatatype to get information on (handle)
    OUT true_lbtrue lower bound of datatype (integer)
    OUT true_extenttrue size of datatype (integer)

    int MPI_Type_get_true_extent(MPI_Datatype datatype, MPI_Aint *true_lb, MPI_Aint *true_extent)

    MPI_TYPE_GET_TRUE_EXTENT(DATATYPE, TRUE_LB, TRUE_EXTENT, IERROR)
    INTEGER DATATYPE, IERROR
    INTEGER(KIND = MPI_ADDRESS_KIND) TRUE_LB, TRUE_EXTENT

    void MPI::Datatype::Get_true_extent(MPI::Aint& true_lb, MPI::Aint& true_extent) const

    true_lb returns the offset of the lowest unit of store which is addressed by the datatype, i.e., the lower bound of the corresponding typemap, ignoring MPI_LB markers. true_extent returns the true size of the datatype, i.e., the extent of the corresponding typemap, ignoring MPI_LB and MPI_UB markers, and performing no rounding for alignment. If the typemap associated with datatype is

    Typemap = { (type0, disp0), ... , (typen-1, dispn-1)}

    Then

    and

    true_extent (Typemap) = true_ub(Typemap) - true_lb(typemap).

    (Readers should compare this with the definitions in Section 3.12.3 of the MPI-1 standard, which describes the function MPI_TYPE_EXTENT.)

    The true_extent is the minimum number of bytes of memory necessary to hold a datatype, uncompressed.



    Up: New Datatype Manipulation Functions Next: Subarray Datatype Constructor Previous: Extent and Bounds of Datatypes


    4.14.4. Subarray Datatype Constructor


    Up: New Datatype Manipulation Functions Next: Distributed Array Datatype Constructor Previous: True Extent of Datatypes

    MPI_TYPE_CREATE_SUBARRAY(ndims, array_of_sizes, array_of_subsizes, array_of_starts, order, oldtype, newtype)
    IN ndimsnumber of array dimensions (positive integer)
    IN array_of_sizesnumber of elements of type oldtype in each dimension of the full array (array of positive integers)
    IN array_of_subsizesnumber of elements of type oldtype in each dimension of the subarray (array of positive integers)
    IN array_of_startsstarting coordinates of the subarray in each dimension (array of nonnegative integers)
    IN orderarray storage order flag (state)
    IN oldtypearray element datatype (handle)
    OUT newtypenew datatype (handle)

    int MPI_Type_create_subarray(int ndims, int array_of_sizes[], int array_of_subsizes[], int array_of_starts[], int order, MPI_Datatype oldtype, MPI_Datatype *newtype)
    MPI_TYPE_CREATE_SUBARRAY(NDIMS, ARRAY_OF_SIZES, ARRAY_OF_SUBSIZES, ARRAY_OF_STARTS, ORDER, OLDTYPE, NEWTYPE, IERROR)
    INTEGER NDIMS, ARRAY_OF_SIZES(*), ARRAY_OF_SUBSIZES(*), ARRAY_OF_STARTS(*), ORDER, OLDTYPE, NEWTYPE, IERROR
    MPI::Datatype MPI::Datatype::Create_subarray(int ndims, const int array_of_sizes[], const int array_of_subsizes[], const int array_of_starts[], int order) const

    The subarray type constructor creates an MPI datatype describing an n-dimensional subarray of an n-dimensional array. The subarray may be situated anywhere within the full array, and may be of any nonzero size up to the size of the larger array as long as it is confined within this array. This type constructor facilitates creating filetypes to access arrays distributed in blocks among processes to a single file that contains the global array.

    This type constructor can handle arrays with an arbitrary number of dimensions and works for both C and Fortran ordered matrices (i.e., row-major or column-major). Note that a C program may use Fortran order and a Fortran program may use C order.

    The ndims parameter specifies the number of dimensions in the full data array and gives the number of elements in array_of_sizes, array_of_subsizes, and array_of_starts.

    The number of elements of type oldtype in each dimension of the n-dimensional array and the requested subarray are specified by array_of_sizes and array_of_subsizes, respectively. For any dimension i, it is erroneous to specify array_of_subsizes[i] < 1 or array_of_subsizes[i] > array_of_sizes[i].

    The array_of_starts contains the starting coordinates of each dimension of the subarray. Arrays are assumed to be indexed starting from zero. For any dimension i, it is erroneous to specify array_of_starts[i] < 0 or array_of_starts[i] > ( array_of_sizes[i] - array_of_subsizes[i]).


    Advice to users.

    In a Fortran program with arrays indexed starting from 1, if the starting coordinate of a particular dimension of the subarray is n, then the entry in array_of_starts for that dimension is n-1. ( End of advice to users.)
    The order argument specifies the storage order for the subarray as well as the full array. It must be set to one of the following:

    { MPI_ORDER_C}
    The ordering used by C arrays, (i.e., row-major order)
    { MPI_ORDER_FORTRAN}
    The ordering used by Fortran arrays, (i.e., column-major order)

    A ndims-dimensional subarray ( newtype) with no extra padding can be defined by the function Subarray() as follows:

    Let the typemap of oldtype have the form: {(type0,disp0),(type1,disp1),...,(typen-1,dispn-1)} where typei is a predefined MPI datatype, and let ex be the extent of oldtype. Then we define the Subarray() function recursively using the following three equations. Equation 1 defines the base step. Equation 1 defines the recursion step when order = MPI_ORDER_FORTRAN, and Equation 1 defines the recursion step when order = MPI_ORDER_C.

    For an example use of MPI_TYPE_CREATE_SUBARRAY in the context of I/O see Section Subarray Filetype Constructor .



    Up: New Datatype Manipulation Functions Next: Distributed Array Datatype Constructor Previous: True Extent of Datatypes


    4.14.5. Distributed Array Datatype Constructor


    Up: New Datatype Manipulation Functions Next: New Predefined Datatypes Previous: Subarray Datatype Constructor

    The distributed array type constructor supports HPF-like [12] data distributions. However, unlike in HPF, the storage order may be specified for C arrays as well as for Fortran arrays.


    Advice to users.

    One can create an HPF-like file view using this type constructor as follows. Complementary filetypes are created by having every process of a group call this constructor with identical arguments (with the exception of rank which should be set appropriately). These filetypes (along with identical disp and etype) are then used to define the view (via MPI_FILE_SET_VIEW). Using this view, a collective data access operation (with identical offsets) will yield an HPF-like distribution pattern. ( End of advice to users.)
    MPI_TYPE_CREATE_DARRAY(size, rank, ndims, array_of_gsizes, array_of_distribs, array_of_dargs, array_of_psizes, order, oldtype, newtype)
    IN sizesize of process group (positive integer)
    IN rankrank in process group (nonnegative integer)
    IN ndimsnumber of array dimensions as well as process grid dimensions (positive integer)
    IN array_of_gsizesnumber of elements of type oldtype in each dimension of global array (array of positive integers)
    IN array_of_distribsdistribution of array in each dimension (array of state)
    IN array_of_dargsdistribution argument in each dimension (array of positive integers)
    IN array_of_psizessize of process grid in each dimension (array of positive integers)
    IN orderarray storage order flag (state)
    IN oldtypeold datatype (handle)
    OUT newtypenew datatype (handle)

    int MPI_Type_create_darray(int size, int rank, int ndims, int array_of_gsizes[], int array_of_distribs[], int array_of_dargs[], int array_of_psizes[], int order, MPI_Datatype oldtype, MPI_Datatype *newtype)
    MPI_TYPE_CREATE_DARRAY(SIZE, RANK, NDIMS, ARRAY_OF_GSIZES, ARRAY_OF_DISTRIBS, ARRAY_OF_DARGS, ARRAY_OF_PSIZES, ORDER, OLDTYPE, NEWTYPE, IERROR)
    INTEGER SIZE, RANK, NDIMS, ARRAY_OF_GSIZES(*), ARRAY_OF_DISTRIBS(*), ARRAY_OF_DARGS(*), ARRAY_OF_PSIZES(*), ORDER, OLDTYPE, NEWTYPE, IERROR
    MPI::Datatype MPI::Datatype::Create_darray(int size, int rank, int ndims, const int array_of_gsizes[], const int array_of_distribs[], const int array_of_dargs[], const int array_of_psizes[], int order) const

    MPI_TYPE_CREATE_DARRAY can be used to generate the datatypes corresponding to the distribution of an ndims-dimensional array of oldtype elements onto an ndims-dimensional grid of logical processes. Unused dimensions of array_of_psizes should be set to 1. (See Example Distributed Array Datatype Constructor .) For a call to MPI_TYPE_CREATE_DARRAY to be correct, the equation prodi=0ndims-1 array_of_psizes[i] = size must be satisfied. The ordering of processes in the process grid is assumed to be row-major, as in the case of virtual Cartesian process topologies in MPI-1.
    Advice to users.

    For both Fortran and C arrays, the ordering of processes in the process grid is assumed to be row-major. This is consistent with the ordering used in virtual Cartesian process topologies in MPI-1. To create such virtual process topologies, or to find the coordinates of a process in the process grid, etc., users may use the corresponding functions provided in MPI-1. ( End of advice to users.)

    Each dimension of the array can be distributed in one of three ways:


    The constant MPI_DISTRIBUTE_DFLT_DARG specifies a default distribution argument. The distribution argument for a dimension that is not distributed is ignored. For any dimension i in which the distribution is MPI_DISTRIBUTE_BLOCK, it erroneous to specify array_of_dargs[i] * array_of_psizes[i] < array_of_gsizes[i].

    For example, the HPF layout ARRAY(CYCLIC(15)) corresponds to MPI_DISTRIBUTE_CYCLIC with a distribution argument of 15, and the HPF layout ARRAY(BLOCK) corresponds to MPI_DISTRIBUTE_BLOCK with a distribution argument of MPI_DISTRIBUTE_DFLT_DARG.

    The order argument is used as in MPI_TYPE_CREATE_SUBARRAY to specify the storage order. Therefore, arrays described by this type constructor may be stored in Fortran (column-major) or C (row-major) order. Valid values for order are MPI_ORDER_FORTRAN and MPI_ORDER_C.

    This routine creates a new MPI datatype with a typemap defined in terms of a function called ``cyclic()'' (see below).

    Without loss of generality, it suffices to define the typemap for the MPI_DISTRIBUTE_CYCLIC case where MPI_DISTRIBUTE_DFLT_DARG is not used.

    MPI_DISTRIBUTE_BLOCK and MPI_DISTRIBUTE_NONE can be reduced to the MPI_DISTRIBUTE_CYCLIC case for dimension i as follows.

    MPI_DISTRIBUTE_BLOCK with array_of_dargs[i] equal to MPI_DISTRIBUTE_DFLT_DARG is equivalent to MPI_DISTRIBUTE_CYCLIC with array_of_dargs[i] set to

    (mpiargarray_of_gsizes[i] + mpiargarray_of_psizes[i] - 1) / mpiargarray_of_psizes[i].

    If array_of_dargs[i] is not MPI_DISTRIBUTE_DFLT_DARG, then MPI_DISTRIBUTE_BLOCK and MPI_DISTRIBUTE_CYCLIC are equivalent.

    MPI_DISTRIBUTE_NONE is equivalent to MPI_DISTRIBUTE_CYCLIC with array_of_dargs[i] set to array_of_gsizes[i].

    Finally, MPI_DISTRIBUTE_CYCLIC with array_of_dargs[i] equal to MPI_DISTRIBUTE_DFLT_DARG is equivalent to MPI_DISTRIBUTE_CYCLIC with array_of_dargs[i] set to 1.

    For MPI_ORDER_FORTRAN, an ndims-dimensional distributed array ( newtype) is defined by the following code fragment:


        oldtype[0] = oldtype; 
        for ( i = 0; i < ndims; i++ ) { 
           oldtype[i+1] = cyclic(array_of_dargs[i], 
                                 array_of_gsizes[i], 
                                 r[i],  
                                 array_of_psizes[i], 
                                 oldtype[i]); 
        } 
        newtype = oldtype[ndims]; 
    
    For MPI_ORDER_C, the code is:


        oldtype[0] = oldtype; 
        for ( i = 0; i < ndims; i++ ) { 
           oldtype[i + 1] = cyclic(array_of_dargs[ndims - i - 1],  
                                   array_of_gsizes[ndims - i - 1], 
                                   r[ndims - i - 1],  
                                   array_of_psizes[ndims - i - 1], 
                                   oldtype[i]); 
        } 
        newtype = oldtype[ndims]; 
     
    
    where r[i] is the position of the process (with rank rank) in the process grid at dimension i. The values of r[i] are given by the following code fragment:


            t_rank = rank; 
            t_size = 1; 
            for (i = 0; i < ndims; i++) 
                    t_size *= array_of_psizes[i]; 
            for (i = 0; i < ndims; i++) { 
                t_size = t_size / array_of_psizes[i]; 
                r[i] = t_rank / t_size; 
                t_rank = t_rank % t_size; 
            } 
    
    Let the typemap of oldtype have the form: {(type0,disp0),(type1,disp1),...,(typen-1,dispn-1)} where typei is a predefined MPI datatype, and let ex be the extent of oldtype.

    Given the above, the function cyclic() is defined as follows:

    where count is defined by this code fragment:

            nblocks = (gsize + (darg - 1)) / darg; 
            count = nblocks / psize; 
            left_over = nblocks - count * psize; 
            if (r < left_over) 
                count = count + 1; 
    
    Here, nblocks is the number of blocks that must be distributed among the processors. Finally, darglast is defined by this code fragment:
            if ((num_in_last_cyclic = gsize % (psize * darg)) == 0) 
                 darg_last = darg; 
            else 
                 darg_last = num_in_last_cyclic - darg * r; 
                 if (darg_last > darg) 
                        darg_last = darg; 
                 if (darg_last <= 0) 
                        darg_last = darg; 
    


    Example Consider generating the filetypes corresponding to the HPF distribution:

          <oldtype> FILEARRAY(100, 200, 300) 
    !HPF$ PROCESSORS PROCESSES(2, 3) 
    !HPF$ DISTRIBUTE FILEARRAY(CYCLIC(10), *, BLOCK) ONTO PROCESSES 
    
    This can be achieved by the following Fortran code, assuming there will be six processes attached to the run:
        ndims = 3 
        array_of_gsizes(1) = 100 
        array_of_distribs(1) = MPI_DISTRIBUTE_CYCLIC 
        array_of_dargs(1) = 10 
        array_of_gsizes(2) = 200 
        array_of_distribs(2) = MPI_DISTRIBUTE_NONE 
        array_of_dargs(2) = 0 
        array_of_gsizes(3) = 300 
        array_of_distribs(3) = MPI_DISTRIBUTE_BLOCK 
        array_of_dargs(3) = MPI_DISTRIBUTE_DFLT_ARG 
        array_of_psizes(1) = 2 
        array_of_psizes(2) = 1 
        array_of_psizes(3) = 3 
        call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr) 
        call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr) 
        call MPI_TYPE_CREATE_DARRAY(size, rank, ndims, array_of_gsizes, & 
             array_of_distribs, array_of_dargs, array_of_psizes,        & 
             MPI_ORDER_FORTRAN, oldtype, newtype, ierr) 
    



    Up: New Datatype Manipulation Functions Next: New Predefined Datatypes Previous: Subarray Datatype Constructor


    4.15. New Predefined Datatypes


    Up: Miscellany Next: Wide Characters Previous: Distributed Array Datatype Constructor



    Up: Miscellany Next: Wide Characters Previous: Distributed Array Datatype Constructor


    4.15.1. Wide Characters


    Up: New Predefined Datatypes Next: Signed Characters and Reductions Previous: New Predefined Datatypes

    A new datatype, MPI_WCHAR, is added, for the purpose of dealing with international character sets such as Unicode.

    MPI_WCHAR is a C type that corresponds to the type wchar_t defined in <stddef.h>. There are no predefined reduction operations for MPI_WCHAR.


    Rationale.

    The fact that MPI_CHAR is associated with the C datatype char, which in turn is often used as a substitute for the ``missing'' byte datatype in C makes it most natural to define this as a new datatype specifically for multi-byte characters. ( End of rationale.)



    Up: New Predefined Datatypes Next: Signed Characters and Reductions Previous: New Predefined Datatypes


    4.15.2. Signed Characters and Reductions


    Up: New Predefined Datatypes Next: Unsigned long long Type Previous: Wide Characters

    MPI-1 doesn't allow reductions on signed or unsigned chars. Since this restriction (formally) prevents a C programmer from performing reduction operations on such types (which could be useful, particularly in an image processing application where pixel values are often represented as ``unsigned char''), we now specify a way for such reductions to be carried out.

    MPI-1.2 already has the C types MPI_CHAR and MPI_UNSIGNED_CHAR. However there is a problem here in that MPI_CHAR is intended to represent a character, not a small integer, and therefore will be translated between machines with different character representations.

    To overcome this, a new MPI predefined datatype, MPI_SIGNED_CHAR, is added to the predefined datatypes of MPI-2, which corresponds to the ANSI C and ANSI C++ datatype signed char.


    Advice to users.

    The types MPI_CHAR and MPI_CHARACTER are intended for characters, and so will be translated to preserve the printable representation, rather than the bit value, if sent between machines with different character codes. The types MPI_SIGNED_CHAR and MPI_UNSIGNED_CHAR should be used in C if the integer value should be preserved.

    ( End of advice to users.)
    The types MPI_SIGNED_CHAR and MPI_UNSIGNED_CHAR can be used in reduction operations. MPI_CHAR (which represents printable characters) cannot be used in reduction operations. This is an extension to MPI-1.2, since MPI-1.2 does not allow the use of MPI_UNSIGNED_CHAR in reduction operations (and does not have the MPI_SIGNED_CHAR type).

    In a heterogeneous environment, MPI_CHAR and MPI_WCHAR will be translated so as to preserve the printable charater, whereas MPI_SIGNED_CHAR and MPI_UNSIGNED_CHAR will be translated so as to preserve the integer value.



    Up: New Predefined Datatypes Next: Unsigned long long Type Previous: Wide Characters


    4.15.3. Unsigned long long Type


    Up: New Predefined Datatypes Next: Canonical MPI_PACK and MPI_UNPACK Previous: Signed Characters and Reductions

    A new type, MPI_UNSIGNED_LONG_LONG in C and MPI::UNSIGNED_LONG_LONG in C++ is added as an optional datatype.


    Rationale. The ISO C9X committee has voted to include long long and unsigned long long as standard C types. ( End of rationale.)



    Up: New Predefined Datatypes Next: Canonical MPI_PACK and MPI_UNPACK Previous: Signed Characters and Reductions


    4.16. Canonical MPI_PACK and MPI_UNPACK


    Up: Miscellany Next: Functions and Macros Previous: Unsigned long long Type

    These functions read/write data to/from the buffer in the ``external32'' data format specified in Section External Data Representation: ``external32'' , and calculate the size needed for packing. Their first arguments specify the data format, for future extensibility, but for MPI-2 the only valid value of the datarep argument is ``external32.''
    Advice to users.

    These functions could be used, for example, to send typed data in a portable format from one MPI implementation to another. ( End of advice to users.)

    The buffer will contain exactly the packed data, without headers.

    MPI_PACK_EXTERNAL(datarep, inbuf, incount, datatype, outbuf, outsize, position )
    IN datarepdata representation (string)
    IN inbufinput buffer start (choice)
    IN incountnumber of input data items (integer)
    IN datatypedatatype of each input data item (handle)
    OUT outbufoutput buffer start (choice)
    IN outsizeoutput buffer size, in bytes (integer)
    INOUT positioncurrent position in buffer, in bytes (integer)

    int MPI_Pack_external(char *datarep, void *inbuf, int incount, MPI_Datatype datatype, void *outbuf, MPI_Aint outsize, MPI_Aint *position)

    MPI_PACK_EXTERNAL(DATAREP, INBUF, INCOUNT, DATATYPE, OUTBUF, OUTSIZE, POSITION, IERROR)
    INTEGER INCOUNT, DATATYPE, IERROR
    INTEGER(KIND=MPI_ADDRESS_KIND) OUTSIZE, POSITION
    CHARACTER*(*) DATAREP
    <type> INBUF(*), OUTBUF(*)

    void MPI::Datatype::Pack_external(const char* datarep, const void* inbuf, int incount, void* outbuf, MPI::Aint outsize, MPI::Aint& position) const

    MPI_UNPACK_EXTERNAL(datarep, inbuf, incount, datatype, outbuf, outsize, position )
    IN datarepdata representation (string)
    IN inbufinput buffer start (choice)
    IN insizeinput buffer size, in bytes (integer)
    INOUT positioncurrent position in buffer, in bytes (integer)
    OUT outbufoutput buffer start (choice)
    IN outcountnumber of output data items (integer)
    IN datatypedatatype of output data item (handle)

    int MPI_Unpack_external(char *datarep, void *inbuf, MPI_Aint insize, MPI_Aint *position, void *outbuf, int outcount, MPI_Datatype datatype)

    MPI_UNPACK_EXTERNAL(DATAREP, INBUF, INSIZE, POSITION, OUTBUF, OUTCOUNT, DATATYPE, IERROR)
    INTEGER OUTCOUNT, DATATYPE, IERROR
    INTEGER(KIND=MPI_ADDRESS_KIND) INSIZE, POSITION
    CHARACTER*(*) DATAREP
    <type> INBUF(*), OUTBUF(*)

    void MPI::Datatype::Unpack_external(const char* datarep, const void* inbuf, MPI::Aint insize, MPI::Aint& position, void* outbuf, int outcount) const

    MPI_PACK_EXTERNAL_SIZE( datarep, incount, datatype, size )
    IN datarepdata representation (string)
    IN incountnumber of input data items (integer)
    IN datatypedatatype of each input data item (handle)
    OUT sizeoutput buffer size, in bytes (integer)

    int MPI_Pack_external_size(char *datarep, int incount, MPI_Datatype datatype, MPI_Aint *size)

    MPI_PACK_EXTERNAL_SIZE(DATAREP, INCOUNT, DATATYPE, SIZE, IERROR)
    INTEGER INCOUNT, DATATYPE, IERROR
    INTEGER(KIND=MPI_ADDRESS_KIND) SIZE
    CHARACTER*(*) DATAREP

    MPI::Aint MPI::Datatype::Pack_external_size(const char* datarep, int incount) const



    Up: Miscellany Next: Functions and Macros Previous: Unsigned long long Type


    4.17. Functions and Macros


    Up: Miscellany Next: Profiling Interface Previous: Canonical MPI_PACK and MPI_UNPACK

    An implementation is allowed to implement MPI_WTIME, MPI_WTICK, PMPI_WTIME, PMPI_WTICK, and the handle-conversion functions ( MPI_Group_f2c, etc.) in Section Transfer of Handles , and no others, as macros in C.


    Advice to implementors.

    Implementors should document which routines are implemented as macros. ( End of advice to implementors.)

    Advice to users.

    If these routines are implemented as macros, they will not work with the MPI profiling interface. ( End of advice to users.)



    Up: Miscellany Next: Profiling Interface Previous: Canonical MPI_PACK and MPI_UNPACK


    4.18. Profiling Interface


    Up: Miscellany Next: Process Creation and Management Previous: Functions and Macros

    The profiling interface, as described in Chapter 8 of MPI-1.1, must be supported for all MPI-2 functions, except those allowed as macros (See Section Functions and Macros ). This requires, in C and Fortran, an alternate entry point name, with the prefix PMPI_ for each MPI function. The profiling interface in C++ is described in Section Profiling .

    For routines implemented as macros, it is still required that the PMPI_ version be supplied and work as expected, but it is not possible to replace at link time the MPI_ version with a user-defined version. This is a change from MPI-1.2.



    Up: Miscellany Next: Process Creation and Management Previous: Functions and Macros


    5. Process Creation and Management


    Up: Contents Next: Introduction Previous: Profiling Interface



    Up: Contents Next: Introduction Previous: Profiling Interface


    5.1. Introduction


    Up: Process Creation and Management Next: The MPI-2 Process Model Previous: Process Creation and Management

    MPI-1 provides an interface that allows processes in a parallel program to communicate with one another. MPI-1 specifies neither how the processes are created, nor how they establish communication. Moreover, an MPI-1 application is static; that is, no processes can be added to or deleted from an application after it has been started.

    MPI users have asked that the MPI-1 model be extended to allow process creation and management after an MPI application has been started. A major impetus comes from the PVM [7] research effort, which has provided a wealth of experience with process management and resource control that illustrates their benefits and potential pitfalls.

    The MPI Forum decided not to address resource control in MPI-2 because it was not able to design a portable interface that would be appropriate for the broad spectrum of existing and potential resource and process controllers. Resource control can encompass a wide range of abilities, including adding and deleting nodes from a virtual parallel machine, reserving and scheduling resources, managing compute partitions of an MPP, and returning information about available resources. MPI-2 assumes that resource control is provided externally --- probably by computer vendors, in the case of tightly coupled systems, or by a third party software package when the environment is a cluster of workstations.

    The reasons for adding process management to MPI are both technical and practical. Important classes of message passing applications require process control. These include task farms, serial applications with parallel modules, and problems that require a run-time assessment of the number and type of processes that should be started. On the practical side, users of workstation clusters who are migrating from PVM to MPI may be accustomed to using PVM's capabilities for process and resource management. The lack of these features is a practical stumbling block to migration.

    While process management is essential, adding it to MPI should not compromise the portability or performance of MPI applications. In particular:


    The MPI-2 process management model addresses these issues in two ways. First, MPI remains primarily a communication library. It does not manage the parallel environment in which a parallel program executes, though it provides a minimal interface between an application and external resource and process managers.

    Second, MPI-2 does not change the concept of communicator. Once a communicator is built, it behaves as specified in MPI-1. A communicator is never changed once created, and it is always created using deterministic collective operations.



    Up: Process Creation and Management Next: The MPI-2 Process Model Previous: Process Creation and Management


    5.2. The MPI-2 Process Model


    Up: Process Creation and Management Next: Starting Processes Previous: Introduction

    The MPI-2 process model allows for the creation and cooperative termination of processes after an MPI application has started. It provides a mechanism to establish communication between the newly created processes and the existing MPI application. It also provides a mechanism to establish communication between two existing MPI applications, even when one did not ``start'' the other.



    Up: Process Creation and Management Next: Starting Processes Previous: Introduction


    5.2.1. Starting Processes


    Up: The MPI-2 Process Model Next: The Runtime Environment Previous: The MPI-2 Process Model

    MPI applications may start new processes through an interface to an external process manager, which can range from a parallel operating system (CMOST) to layered software (POE) to an rsh command (p4).

    MPI_COMM_SPAWN starts MPI processes and establishes communication with them, returning an intercommunicator. MPI_COMM_SPAWN_MULTIPLE starts several different binaries (or the same binary with different arguments), placing them in the same MPI_COMM_WORLD and returning an intercommunicator.

    MPI uses the existing group abstraction to represent processes. A process is identified by a (group, rank) pair.



    Up: The MPI-2 Process Model Next: The Runtime Environment Previous: The MPI-2 Process Model


    5.2.2. The Runtime Environment


    Up: The MPI-2 Process Model Next: Process Manager Interface Previous: Starting Processes

    The MPI_COMM_SPAWN and MPI_COMM_SPAWN_MULTIPLE routines provide an interface between MPI and the runtime environment of an MPI application. The difficulty is that there is an enormous range of runtime environments and application requirements, and MPI must not be tailored to any particular one. Examples of such environments are:


    MPI assumes, implicitly, the existence of an environment in which an application runs. It does not provide ``operating system'' services, such as a general ability to query what processes are running, to kill arbitrary processes, to find out properties of the runtime environment (how many processors, how much memory, etc.).

    Complex interaction of an MPI application with its runtime environment should be done through an environment-specific API. An example of such an API would be the PVM task and machine management routines --- pvm_addhosts, pvm_config, pvm_tasks, etc., possibly modified to return an MPI (group,rank) when possible. A Condor or PBS API would be another possibility.

    At some low level, obviously, MPI must be able to interact with the runtime system, but the interaction is not visible at the application level and the details of the interaction are not specified by the MPI standard.

    In many cases, it is impossible to keep environment-specific information out of the MPI interface without seriously compromising MPI functionality. To permit applications to take advantage of environment-specific functionality, many MPI routines take an info argument that allows an application to specify environment-specific information. There is a tradeoff between functionality and portability: applications that make use of info are not portable.

    MPI does not require the existence of an underlying ``virtual machine'' model, in which there is a consistent global view of an MPI application and an implicit ``operating system'' managing resources and processes. For instance, processes spawned by one task may not be visible to another; additional hosts added to the runtime environment by one process may not be visible in another process; tasks spawned by different processes may not be automatically distributed over available resources.

    Interaction between MPI and the runtime environment is limited to the following areas:




    Up: The MPI-2 Process Model Next: Process Manager Interface Previous: Starting Processes


    5.3. Process Manager Interface


    Up: Process Creation and Management Next: Processes in MPI Previous: The Runtime Environment



    Up: Process Creation and Management Next: Processes in MPI Previous: The Runtime Environment


    5.3.1. Processes in MPI


    Up: Process Manager Interface Next: Starting Processes and Establishing Communication Previous: Process Manager Interface

    A process is represented in MPI by a (group, rank) pair. A (group, rank) pair specifies a unique process but a process does not determine a unique (group, rank) pair, since a process may belong to several groups.



    Up: Process Manager Interface Next: Starting Processes and Establishing Communication Previous: Process Manager Interface


    5.3.2. Starting Processes and Establishing Communication


    Up: Process Manager Interface Next: Starting Multiple Executables and Establishing Communication Previous: Processes in MPI

    The following routine starts a number of MPI processes and establishes communication with them, returning an intercommunicator.


    Advice to users.

    It is possible in MPI to start a static SPMD or MPMD application by starting first one process and having that process start its siblings with MPI_COMM_SPAWN. This practice is discouraged primarily for reasons of performance. If possible, it is preferable to start all processes at once, as a single MPI-1 application. ( End of advice to users.)
    MPI_COMM_SPAWN(command, argv, maxprocs, info, root, comm, intercomm, array_of_errcodes)
    IN commandname of program to be spawned (string, significant only at root)
    IN argvarguments to command (array of strings, significant only at root)
    IN maxprocsmaximum number of processes to start (integer, significant only at root)
    IN infoa set of key-value pairs telling the runtime system where and how to start the processes (handle, significant only at root)
    IN rootrank of process in which previous arguments are examined (integer)
    IN commintracommunicator containing group of spawning processes (handle)
    OUT intercommintercommunicator between original group and the
    newly spawned group (handle)
    OUT array_of_errcodesone code per process (array of integer)
    int MPI_Comm_spawn(char *command, char *argv[], int maxprocs, MPI_Info info, int root, MPI_Comm comm, MPI_Comm *intercomm, int array_of_errcodes[])
    MPI_COMM_SPAWN(COMMAND, ARGV, MAXPROCS, INFO, ROOT, COMM, INTERCOMM, ARRAY_OF_ERRCODES, IERROR)
    CHARACTER*(*) COMMAND, ARGV(*)
    INTEGER INFO, MAXPROCS, ROOT, COMM, INTERCOMM, ARRAY_OF_ERRCODES(*), IERROR
    MPI::Intercomm MPI::Intracomm::Spawn(const char* command, const char* argv[], int maxprocs, const MPI::Info& info, int root, int array_of_errcodes[]) const
    MPI::Intercomm MPI::Intracomm::Spawn(const char* command, const char* argv[], int maxprocs, const MPI::Info& info, int root) const

    MPI_COMM_SPAWN tries to start maxprocs identical copies of the MPI program specified by command, establishing communication with them and returning an intercommunicator. The spawned processes are referred to as children. The children have their own MPI_COMM_WORLD, which is separate from that of the parents. MPI_COMM_SPAWN is collective over comm, and also may not return until MPI_INIT has been called in the children. Similarly, MPI_INIT in the children may not return until all parents have called MPI_COMM_SPAWN. In this sense, MPI_COMM_SPAWN in the parents and MPI_INIT in the children form a collective operation over the union of parent and child processes. The intercommunicator returned by MPI_COMM_SPAWN contains the parent processes in the local group and the child processes in the remote group. The ordering of processes in the local and remote groups is the same as the as the ordering of the group of the comm in the parents and of MPI_COMM_WORLD of the children, respectively. This intercommunicator can be obtained in the children through the function MPI_COMM_GET_PARENT.


    Advice to users.

    An implementation may automatically establish communication before MPI_INIT is called by the children. Thus, completion of MPI_COMM_SPAWN in the parent does not necessarily mean that MPI_INIT has been called in the children (although the returned intercommunicator can be used immediately). ( End of advice to users.)

    The command argument The command argument is a string containing the name of a program to be spawned. The string is null-terminated in C. In Fortran, leading and trailing spaces are stripped. MPI does not specify how to find the executable or how the working directory is determined. These rules are implementation-dependent and should be appropriate for the runtime environment.


    Advice to implementors.

    The implementation should use a natural rule for finding executables and determining working directories. For instance, a homogeneous system with a global file system might look first in the working directory of the spawning process, or might search the directories in a PATH environment variable as do Unix shells. An implementation on top of PVM would use PVM's rules for finding executables (usually in $HOME/pvm3/bin/$PVM_ARCH). An MPI implementation running under POE on an IBM SP would use POE's method of finding executables. An implementation should document its rules for finding executables and determining working directories, and a high-quality implementation should give the user some control over these rules. ( End of advice to implementors.)
    If the program named in command does not call MPI_INIT, but instead forks a process that calls MPI_INIT, the results are undefined. Implementations may allow this case to work but are not required to.


    Advice to users.

    MPI does not say what happens if the program you start is a shell script and that shell script starts a program that calls MPI_INIT. Though some implementations may allow you to do this, they may also have restrictions, such as requiring that arguments supplied to the shell script be supplied to the program, or requiring that certain parts of the environment not be changed. ( End of advice to users.)

    The argv argument argv is an array of strings containing arguments that are passed to the program. The first element of argv is the first argument passed to command, not, as is conventional in some contexts, the command itself. The argument list is terminated by NULL in C and C++ and an empty string in Fortran. In Fortran, leading and trailing spaces are always stripped, so that a string consisting of all spaces is considered an empty string. The constant MPI_ARGV_NULL may be used in C, C++ and Fortran to indicate an empty argument list. In C and C++, this constant is the same as NULL.


    Example Examples of argv in C and Fortran

    To run the program ``ocean'' with arguments ``-gridfile'' and ``ocean1.grd'' in C:

           char command[] = "ocean"; 
           char *argv[] = {"-gridfile", "ocean1.grd", NULL}; 
           MPI_Comm_spawn(command, argv, ...); 
    
    or, if not everything is known at compile time:
           char *command; 
           char **argv; 
           command = "ocean"; 
           argv=(char **)malloc(3 * sizeof(char *)); 
           argv[0] = "-gridfile"; 
           argv[1] = "ocean1.grd"; 
           argv[2] = NULL; 
           MPI_Comm_spawn(command, argv, ...); 
    
    In Fortran:
           CHARACTER*25 command, argv(3) 
           command = ' ocean ' 
           argv(1) = ' -gridfile ' 
           argv(2) = ' ocean1.grd' 
           argv(3) = ' ' 
           call MPI_COMM_SPAWN(command, argv, ...) 
    

    Arguments are supplied to the program if this is allowed by the operating system. In C, the MPI_COMM_SPAWN argument argv differs from the argv argument of main in two respects. First, it is shifted by one element. Specifically, argv[0] of main is provided by the implementation and conventionally contains the name of the program (given by command). argv[1] of main corresponds to argv[0] in MPI_COMM_SPAWN, argv[2] of main to argv[1] of MPI_COMM_SPAWN, etc. Second, argv of MPI_COMM_SPAWN must be null-terminated, so that its length can be determined. Passing an argv of MPI_ARGV_NULL to MPI_COMM_SPAWN results in main receiving argc of 1 and an argv whose element 0 is (conventionally) the name of the program.

    If a Fortran implementation supplies routines that allow a program to obtain its arguments, the arguments may be available through that mechanism. In C, if the operating system does not support arguments appearing in argv of main(), the MPI implementation may add the arguments to the argv that is passed to MPI_INIT.

    The maxprocs argument MPI tries to spawn maxprocs processes. If it is unable to spawn maxprocs processes, it raises an error of class MPI_ERR_SPAWN.

    An implementation may allow the info argument to change the default behavior, such that if the implementation is unable to spawn all maxprocs processes, it may spawn a smaller number of processes instead of raising an error. In principle, the info argument may specify an arbitrary set of allowed values for the number of processes spawned. The set {mi} does not necessarily include the value maxprocs. If an implementation is able to spawn one of these allowed numbers of processes, MPI_COMM_SPAWN returns successfully and the number of spawned processes, m, is given by the size of the remote group of intercomm. If m is less than maxproc, reasons why the other processes were not spawned are given in array_of_errcodes as described below. If it is not possible to spawn one of the allowed numbers of processes, MPI_COMM_SPAWN raises an error of class MPI_ERR_SPAWN.

    A spawn call with the default behavior is called hard. A spawn call for which fewer than maxprocs processes may be returned is called soft. See Section Reserved Keys for more information on the soft key for info.


    Advice to users.

    By default, requests are hard and MPI errors are fatal. This means that by default there will be a fatal error if MPI cannot spawn all the requested processes. If you want the behavior ``spawn as many processes as possible, up to N,'' you should do a soft spawn, where the set of allowed values {mi} is {0 ... N}. However, this is not completely portable, as implementations are not required to support soft spawning. ( End of advice to users.)

    The info argument The info argument to all of the routines in this chapter is an opaque handle of type MPI_Info in C, MPI::Info in C++ and INTEGER in Fortran. It is a container for a number of user-specified ( key, value) pairs. key and value are strings (null-terminated char* in C, character*(*) in Fortran). Routines to create and manipulate the info argument are described in Section The Info Object .

    For the SPAWN calls, info provides additional (and possibly implementation-dependent) instructions to MPI and the runtime system on how to start processes. An application may pass MPI_INFO_NULL in C or Fortran. Portable programs not requiring detailed control over process locations should use MPI_INFO_NULL.

    MPI does not specify the content of the info argument, except to reserve a number of special key values (see Section Reserved Keys ). The info argument is quite flexible and could even be used, for example, to specify the executable and its command-line arguments. In this case the command argument to MPI_COMM_SPAWN could be empty. The ability to do this follows from the fact that MPI does not specify how an executable is found, and the info argument can tell the runtime system where to ``find'' the executable ``'' (empty string). Of course a program that does this will not be portable across MPI implementations.

    The root argument All arguments before the root argument are examined only on the process whose rank in comm is equal to root. The value of these arguments on other processes is ignored.

    The array_of_errcodes argument The array_of_errcodes is an array of length maxprocs in which MPI reports the status of each process that MPI was requested to start. If all maxprocs processes were spawned, array_of_errcodes is filled in with the value MPI_SUCCESS. If only m ( ) processes are spawned, m of the entries will contain MPI_SUCCESS and the rest will contain an implementation-specific error code indicating the reason MPI could not start the process. MPI does not specify which entries correspond to failed processes. An implementation may, for instance, fill in error codes in one-to-one correspondence with a detailed specification in the info argument. These error codes all belong to the error class MPI_ERR_SPAWN if there was no error in the argument list. In C or Fortran, an application may pass MPI_ERRCODES_IGNORE if it is not interested in the error codes. In C++ this constant does not exist, and the array_of_errcodes argument may be omitted from the argument list.
    Advice to implementors.

    MPI_ERRCODES_IGNORE in Fortran is a special type of constant, like MPI_BOTTOM. See the discussion in Section Named Constants . ( End of advice to implementors.)

    MPI_COMM_GET_PARENT(parent)
    OUT parentthe parent communicator (handle)
    int MPI_Comm_get_parent(MPI_Comm *parent)
    MPI_COMM_GET_PARENT(PARENT, IERROR)
    INTEGER PARENT, IERROR
    static MPI::Intercomm MPI::Comm::Get_parent()

    If a process was started with MPI_COMM_SPAWN or MPI_COMM_SPAWN_MULTIPLE, MPI_COMM_GET_PARENT returns the ``parent'' intercommunicator of the current process. This parent intercommunicator is created implicitly inside of MPI_INIT and is the same intercommunicator returned by SPAWN in the parents.

    If the process was not spawned, MPI_COMM_GET_PARENT returns MPI_COMM_NULL.

    After the parent communicator is freed or disconnected, MPI_COMM_GET_PARENT returns MPI_COMM_NULL.


    Advice to users.

    MPI_COMM_GET_PARENT returns a handle to a single intercommunicator. Calling MPI_COMM_GET_PARENT a second time returns a handle to the same intercommunicator. Freeing the handle with MPI_COMM_DISCONNECT or MPI_COMM_FREE will cause other references to the intercommunicator to become invalid (dangling). Note that calling MPI_COMM_FREE on the parent communicator is not useful. ( End of advice to users.)


    Rationale.

    The desire of the Forum was to create a constant MPI_COMM_PARENT similar to MPI_COMM_WORLD. Unfortunately such a constant cannot be used (syntactically) as an argument to MPI_COMM_DISCONNECT, which is explicitly allowed. ( End of rationale.)



    Up: Process Manager Interface Next: Starting Multiple Executables and Establishing Communication Previous: Processes in MPI


    5.3.3. Starting Multiple Executables and Establishing Communication


    Up: Process Manager Interface Next: Reserved Keys Previous: Starting Processes and Establishing Communication

    While MPI_COMM_SPAWN is sufficient for most cases, it does not allow the spawning of multiple binaries, or of the same binary with multiple sets of arguments. The following routine spawns multiple binaries or the same binary with multiple sets of arguments, establishing communication with them and placing them in the same MPI_COMM_WORLD.

    MPI_COMM_SPAWN_MULTIPLE(count, array_of_commands, array_of_argv, array_of_maxprocs, array_of_info, root, comm, intercomm, array_of_errcodes)
    IN countnumber of commands (positive integer, significant to MPI only at root --- see advice to users)
    IN array_of_commandsprograms to be executed (array of strings, significant only at root)
    IN array_of_argvarguments for commands (array of array of strings, significant only at root)
    IN array_of_maxprocsmaximum number of processes to start for each command (array of integer, significant only at root)
    IN array_of_infoinfo objects telling the runtime system where and how to start processes (array of handles, significant only at root)
    IN rootrank of process in which previous arguments are examined (integer)
    IN commintracommunicator containing group of spawning processes (handle)
    OUT intercommintercommunicator between original group and newly spawned group (handle)
    OUT array_of_errcodesone error code per process (array of integer)
    int MPI_Comm_spawn_multiple(int count, char *array_of_commands[], char **array_of_argv[], int array_of_maxprocs[], MPI_Info array_of_info[], int root, MPI_Comm comm, MPI_Comm *intercomm, int array_of_errcodes[])
    MPI_COMM_SPAWN_MULTIPLE(COUNT, ARRAY_OF_COMMANDS, ARRAY_OF_ARGV, ARRAY_OF_MAXPROCS, ARRAY_OF_INFO, ROOT, COMM, INTERCOMM, ARRAY_OF_ERRCODES, IERROR)
    INTEGER COUNT, ARRAY_OF_INFO(*), ARRAY_OF_MAXPROCS(*), ROOT, COMM, INTERCOMM, ARRAY_OF_ERRCODES(*), IERROR
    CHARACTER*(*) ARRAY_OF_COMMANDS(*), ARRAY_OF_ARGV(COUNT, *)
    MPI::Intercomm MPI::Intracomm::Spawn_multiple(int count, const char* array_of_commands[], const char** array_of_argv[], const int array_of_maxprocs[], const MPI::Info array_of_info[], int root, int array_of_errcodes[])
    MPI::Intercomm MPI::Intracomm::Spawn_multiple(int count, const char* array_of_commands[], const char** array_of_argv[], const int array_of_maxprocs[], const MPI::Info array_of_info[], int root)

    MPI_COMM_SPAWN_MULTIPLE is identical to MPI_COMM_SPAWN except that there are multiple executable specifications. The first argument, count, gives the number of specifications. Each of the next four arguments are simply arrays of the corresponding arguments in MPI_COMM_SPAWN. For the Fortran version of array_of_argv, the element array_of_argv(i,j) is the jth argument to command number i.
    Rationale.

    This may seem backwards to Fortran programmers who are familiar with Fortran's column-major ordering. However, it is necessary to do it this way to allow MPI_COMM_SPAWN to sort out arguments. Note that the leading dimension of array_of_argv must be the same as count. ( End of rationale.)

    Advice to users.

    The argument count is interpreted by MPI only at the root, as is array_of_argv. Since the leading dimension of array_of_argv is count, a non-positive value of count at a non-root node could theoretically cause a runtime bounds check error, even though array_of_argv should be ignored by the subroutine. If this happens, you should explicitly supply a reasonable value of count on the non-root nodes. ( End of advice to users.)

    In any language, an application may use the constant MPI_ARGVS_NULL (which is likely to be (char ***)0 in C) to specify that no arguments should be passed to any commands. The effect of setting individual elements of array_of_argv to MPI_ARGV_NULL is not defined. To specify arguments for some commands but not others, the commands without arguments should have a corresponding argv whose first element is null ( (char *)0 in C and empty string in Fortran).

    All of the spawned processes have the same MPI_COMM_WORLD. Their ranks in MPI_COMM_WORLD correspond directly to the order in which the commands are specified in MPI_COMM_SPAWN_MULTIPLE. Assume that m1 processes are generated by the first command, m2 by the second, etc. The processes corresponding to the first command have ranks 0, 1, ..., m1-1. The processes in the second command have ranks m1, m1+1, ..., m1+m2-1. The processes in the third have ranks m1+m2, m1+m2+1, ..., m1+m2+m3-1, etc.


    Advice to users.

    Calling MPI_COMM_SPAWN multiple times would create many sets of children with different MPI_COMM_WORLDs whereas MPI_COMM_SPAWN_MULTIPLE creates children with a single MPI_COMM_WORLD, so the two methods are not completely equivalent. There are also two performance-related reasons why, if you need to spawn multiple executables, you may want to use MPI_COMM_SPAWN_MULTIPLE instead of calling MPI_COMM_SPAWN several times. First, spawning several things at once may be faster than spawning them sequentially. Second, in some implementations, communication between processes spawned at the same time may be faster than communication between processes spawned separately. ( End of advice to users.)
    The array_of_errcodes argument is 1-dimensional array of size , where ni is the ith element of array_of_maxprocs. Command number i corresponds to the ni contiguous slots in this array from element to . Error codes are treated as for MPI_COMM_SPAWN.


    Example Examples of array_of_argv in C and Fortran

    To run the program ``ocean'' with arguments ``-gridfile'' and ``ocean1.grd'' and the program ``atmos'' with argument ``atmos.grd'' in C:

           char *array_of_commands[2] = {"ocean", "atmos"}; 
           char **array_of_argv[2]; 
           char *argv0[] = {"-gridfile", "ocean1.grd", (char *)0}; 
           char *argv1[] = {"atmos.grd", (char *)0}; 
           array_of_argv[0] = argv0; 
           array_of_argv[1] = argv1; 
           MPI_Comm_spawn_multiple(2, array_of_commands, array_of_argv, ...); 
    
    Here's how you do it in Fortran:
           CHARACTER*25 commands(2), array_of_argv(2, 3) 
           commands(1) = ' ocean ' 
           array_of_argv(1, 1) = ' -gridfile ' 
           array_of_argv(1, 2) = ' ocean1.grd' 
           array_of_argv(1, 3) = ' ' 
     
           commands(2) = ' atmos ' 
           array_of_argv(2, 1) = ' atmos.grd ' 
           array_of_argv(2, 2) = ' ' 
     
           call MPI_COMM_SPAWN_MULTIPLE(2, commands, array_of_argv, ...) 
    



    Up: Process Manager Interface Next: Reserved Keys Previous: Starting Processes and Establishing Communication


    5.3.4. Reserved Keys


    Up: Process Manager Interface Next: Spawn Example Previous: Starting Multiple Executables and Establishing Communication

    The following keys are reserved. An implementation is not required to interpret these keys, but if it does interpret the key, it must provide the functionality described.

    { host}
    Value is a hostname. The format of the hostname is determined by the implementation.

    { arch}
    Value is an architecture name. Valid architecture names and what they mean are determined by the implementation.

    { wdir}
    Value is the name of a directory on a machine on which the spawned process(es) execute(s). This directory is made the working directory of the executing process(es). The format of the directory name is determined by the implementation.

    { path}
    Value is a directory or set of directories where the implementation should look for the executable. The format of path is determined by the implementation.

    { file}
    Value is the name of a file in which additional information is specified. The format of the filename and internal format of the file are determined by the implementation.

    { soft}
    Value specifies a set of numbers which are allowed values for the number of processes that MPI_COMM_SPAWN (et al.) may create. The format of the value is a comma-separated list of Fortran-90 triplets each of which specifies a set of integers and which together specify the set formed by the union of these sets. Negative values in this set and values greater than maxprocs are ignored. MPI will spawn the largest number of processes it can, consistent with some number in the set. The order in which triplets are given is not significant.

    By Fortran-90 triplets, we mean:

      1. a means a
      2. a:b means a, a+1, a+2, ..., b
      3. a:b:c means a, a+c, a+2c, ..., a+ck, where for c > 0, k is the largest integer for which and for c < 0, k is the largest integer for which . If b > a then c must be positive. If b < a then c must be negative.
    Examples:
      1. a:b gives a range between a and b
      2. 0:N gives full ``soft'' functionality
      3. 1,2,4,8,16,32,64,128,256,512,1024,2048,4096 allows power-of-two number of processes.
      4. 2:10000:2 allows even number of processes.
      5. 2:10:2,7 allows 2, 4, 6, 7, 8, or 10 processes.



    Up: Process Manager Interface Next: Spawn Example Previous: Starting Multiple Executables and Establishing Communication


    5.3.5. Spawn Example


    Up: Process Manager Interface Next: Manager-worker Example, Using MPI_SPAWN. Previous: Reserved Keys



    Up: Process Manager Interface Next: Manager-worker Example, Using MPI_SPAWN. Previous: Reserved Keys


    5.3.5.1. Manager-worker Example, Using MPI_SPAWN.


    Up: Spawn Example Next: Establishing Communication Previous: Spawn Example


    /* manager */ 
    #include "mpi.h" 
    int main(int argc, char *argv[]) 
    { 
       int world_size, universe_size, *universe_sizep, flag; 
       MPI_Comm everyone;           /* intercommunicator */ 
       char worker_program[100]; 
     
       MPI_Init(&argc, &argv); 
       MPI_Comm_size(MPI_COMM_WORLD, &world_size); 
     
       if (world_size != 1)    error("Top heavy with management"); 
     
       MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE,  
                    &universe_sizep, &flag);  
       if (!flag) { 
            printf("This MPI does not support UNIVERSE_SIZE. How many\n\ 
    processes total?"); 
            scanf("%d", &universe_size); 
       } else universe_size = *universe_sizep; 
       if (universe_size == 1) error("No room to start workers"); 
     
       /*  
        * Now spawn the workers. Note that there is a run-time determination 
        * of what type of worker to spawn, and presumably this calculation must 
        * be done at run time and cannot be calculated before starting 
        * the program. If everything is known when the application is  
        * first started, it is generally better to start them all at once 
        * in a single MPI_COMM_WORLD.  
        */ 
     
       choose_worker_program(worker_program); 
       MPI_Comm_spawn(worker_program, MPI_ARGV_NULL, universe_size-1,  
                 MPI_INFO_NULL, 0, MPI_COMM_SELF, &everyone,  
                 MPI_ERRCODES_IGNORE); 
       /* 
        * Parallel code here. The communicator "everyone" can be used 
        * to communicate with the spawned processes, which have ranks 0,.. 
        * MPI_UNIVERSE_SIZE-1 in the remote group of the intercommunicator 
        * "everyone". 
        */ 
     
       MPI_Finalize(); 
       return 0; 
    } 
    

    /* worker */ 
     
    #include "mpi.h" 
    int main(int argc, char *argv[]) 
    { 
       int size; 
       MPI_Comm parent; 
       MPI_Init(&argc, &argv); 
       MPI_Comm_get_parent(&parent); 
       if (parent == MPI_COMM_NULL) error("No parent!"); 
       MPI_Comm_remote_size(parent, &size); 
       if (size != 1) error("Something's wrong with the parent"); 
     
       /* 
        * Parallel code here.  
        * The manager is represented as the process with rank 0 in (the remote 
        * group of) MPI_COMM_PARENT.  If the workers need to communicate among 
        * themselves, they can use MPI_COMM_WORLD. 
        */ 
     
       MPI_Finalize(); 
       return 0; 
    } 
     
     
    



    Up: Spawn Example Next: Establishing Communication Previous: Spawn Example


    5.4. Establishing Communication


    Up: Process Creation and Management Next: Names, Addresses, Ports, and All That Previous: Manager-worker Example, Using MPI_SPAWN.

    This section provides functions that establish communication between two sets of MPI processes that do not share a communicator.

    Some situations in which these functions are useful are:

      1. Two parts of an application that are started independently need to communicate.
      2. A visualization tool wants to attach to a running process.
      3. A server wants to accept connections from multiple clients. Both clients and server may be parallel programs.

    In each of these situations, MPI must establish communication channels where none existed before, and there is no parent/child relationship. The routines described in this section establish communication between the two sets of processes by creating an MPI intercommunicator, where the two groups of the intercommunicator are the original sets of of processes.

    Establishing contact between two groups of processes that do not share an existing communicator is a collective but asymmetric process. One group of processes indicates its willingness to accept connections from other groups of processes. We will call this group the (parallel) server, even if this is not a client/server type of application. The other group connects to the server; we will call it the client.


    Advice to users.

    While the names client and server are used throughout this section, MPI does not guarantee the traditional robustness of client server systems. The functionality described in this section is intended to allow two cooperating parts of the same application to communicate with one another. For instance, a client that gets a segmentation fault and dies, or one that doesn't participate in a collective operation may cause a server to crash or hang. ( End of advice to users.)



    Up: Process Creation and Management Next: Names, Addresses, Ports, and All That Previous: Manager-worker Example, Using MPI_SPAWN.


    5.4.1. Names, Addresses, Ports, and All That


    Up: Establishing Communication Next: Server Routines Previous: Establishing Communication

    Almost all of the complexity in MPI client/server routines addresses the question ``how does the client find out how to contact the server?'' The difficulty, of course, is that there is no existing communication channel between them, yet they must somehow agree on a rendezvous point where they will establish communication --- Catch 22.

    Agreeing on a rendezvous point always involves a third party. The third party may itself provide the rendezvous point or may communicate rendezvous information from server to client. Complicating matters might be the fact that a client doesn't really care what server it contacts, only that it be able to get in touch with one that can handle its request.

    Ideally, MPI can accommodate a wide variety of run-time systems while retaining the ability to write simple portable code. The following should be compatible with MPI:


    MPI does not require a nameserver, so not all implementations will be able to support all of the above scenarios. However, MPI provides an optional nameserver interface, and is compatible with external name servers.

    A port_name is a system-supplied string that encodes a low-level network address at which a server can be contacted. Typically this is an IP address and a port number, but an implementation is free to use any protocol. The server establishes a port_name with the MPI_OPEN_PORT routine. It accepts a connection to a given port with MPI_COMM_ACCEPT. A client uses port_name to connect to the server.

    By itself, the port_name mechanism is completely portable, but it may be clumsy to use because of the necessity to communicate port_name to the client. It would be more convenient if a server could specify that it be known by an application-supplied service_name so that the client could connect to that service_name without knowing the port_name.

    An MPI implementation may allow the server to publish a ( port_name, service_name) pair with MPI_PUBLISH_NAME and the client to retrieve the port name from the service name with MPI_LOOKUP_NAME. This allows three levels of portability, with increasing levels of functionality.

      1. Applications that do not rely on the ability to publish names are the most portable. Typically the port_name must be transferred ``by hand'' from server to client.
      2. Applications that use the MPI_PUBLISH_NAME mechanism are completely portable among implementations that provide this service. To be portable among all implementations, these applications should have a fall-back mechanism that can be used when names are not published.
      3. Applications may ignore MPI's name publishing functionality and use their own mechanism (possibly system-supplied) to publish names. This allows arbitrary flexibility but is not portable.



    Up: Establishing Communication Next: Server Routines Previous: Establishing Communication


    5.4.2. Server Routines


    Up: Establishing Communication Next: Client Routines Previous: Names, Addresses, Ports, and All That

    A server makes itself available with two routines. First it must call MPI_OPEN_PORT to establish a port at which it may be contacted. Secondly it must call MPI_COMM_ACCEPT to accept connections from clients.

    MPI_OPEN_PORT(info, port_name)
    IN infoimplementation-specific information on how to establish an address (handle)
    OUT port_namenewly established port (string)
    int MPI_Open_port(MPI_Info info, char *port_name)
    MPI_OPEN_PORT(INFO, PORT_NAME, IERROR)
    CHARACTER*(*) PORT_NAME
    INTEGER INFO, IERROR
    void MPI::Open_port(const MPI::Info& info, char* port_name)

    This function establishes a network address, encoded in the port_name string, at which the server will be able to accept connections from clients. port_name is supplied by the system, possibly using information in the info argument.

    MPI copies a system-supplied port name into port_name. port_name identifies the newly opened port and can be used by a client to contact the server. The maximum size string that may be supplied by the system is MPI_MAX_PORT_NAME.
    Advice to users.

    The system copies the port name into port_name. The application must pass a buffer of sufficient size to hold this value. ( End of advice to users.)

    port_name is essentially a network address. It is unique within the communication universe to which it belongs (determined by the implementation), and may be used by any client within that communication universe. For instance, if it is an internet (host:port) address, it will be unique on the internet. If it is a low level switch address on an IBM SP, it will be unique to that SP.
    Advice to implementors.

    These examples are not meant to constrain implementations. A port_name could, for instance, contain a user name or the name of a batch job, as long as it is unique within some well-defined communication domain. The larger the communication domain, the more useful MPI's client/server functionality will be. ( End of advice to implementors.)
    The precise form of the address is implementation-defined. For instance, an internet address may be a host name or IP address, or anything that the implementation can decode into an IP address. A port name may be reused after it is freed with MPI_CLOSE_PORT and released by the system.


    Advice to implementors.

    Since the user may type in port_name by hand, it is useful to choose a form that is easily readable and does not have embedded spaces. ( End of advice to implementors.)
    info may be used to tell the implementation how to establish the address. It may, and usually will, be MPI_INFO_NULL in order to get the implementation defaults.

    MPI_CLOSE_PORT(port_name)
    IN port_namea port (string)
    int MPI_Close_port(char *port_name)
    MPI_CLOSE_PORT(PORT_NAME, IERROR)
    CHARACTER*(*) PORT_NAME
    INTEGER IERROR
    void MPI::Close_port(const char* port_name)
    This function releases the network address represented by port_name.

    MPI_COMM_ACCEPT(port_name, info, root, comm, newcomm)
    IN port_nameport name (string, used only on root)
    IN infoimplementation-dependent information (handle, used only on root)
    IN rootrank in comm of root node (integer)
    IN commintracommunicator over which call is collective (handle)
    OUT newcommintercommunicator with client as remote group (handle)
    int MPI_Comm_accept(char *port_name, MPI_Info info, int root, MPI_Comm comm, MPI_Comm *newcomm)
    MPI_COMM_ACCEPT(PORT_NAME, INFO, ROOT, COMM, NEWCOMM, IERROR)
    CHARACTER*(*) PORT_NAME
    INTEGER INFO, ROOT, COMM, NEWCOMM, IERROR
    MPI::Intercomm MPI::Intracomm::Accept(const char* port_name, const MPI::Info& info, int root) const

    MPI_COMM_ACCEPT establishes communication with a client. It is collective over the calling communicator. It returns an intercommunicator that allows communication with the client.

    The port_name must have been established through a call to MPI_OPEN_PORT.

    info is a implementation-defined string that may allow fine control over the ACCEPT call.



    Up: Establishing Communication Next: Client Routines Previous: Names, Addresses, Ports, and All That


    5.4.3. Client Routines


    Up: Establishing Communication Next: Name Publishing Previous: Server Routines

    There is only one routine on the client side.

    MPI_COMM_CONNECT(port_name, info, root, comm, newcomm)
    IN port_namenetwork address (string, used only on root)
    IN infoimplementation-dependent information (handle, used only on root)
    IN rootrank in comm of root node (integer)
    IN commintracommunicator over which call is collective (handle)
    OUT newcommintercommunicator with server as remote group (handle)
    int MPI_Comm_connect(char *port_name, MPI_Info info, int root, MPI_Comm comm, MPI_Comm *newcomm)
    MPI_COMM_CONNECT(PORT_NAME, INFO, ROOT, COMM, NEWCOMM, IERROR)
    CHARACTER*(*) PORT_NAME
    INTEGER INFO, ROOT, COMM, NEWCOMM, IERROR
    MPI::Intercomm MPI::Intracomm::Connect(const char* port_name, const MPI::Info& info, int root) const

    This routine establishes communication with a server specified by port_name. It is collective over the calling communicator and returns an intercommunicator in which the remote group participated in an MPI_COMM_ACCEPT.

    If the named port does not exist (or has been closed), MPI_COMM_CONNECT raises an error of class MPI_ERR_PORT.

    If the port exists, but does not have a pending MPI_COMM_ACCEPT, the connection attempt will eventually time out after an implementation-defined time, or succeed when the server calls MPI_COMM_ACCEPT. In the case of a time out, MPI_COMM_CONNECT raises an error of class MPI_ERR_PORT.


    Advice to implementors.

    The time out period may be arbitrarily short or long. However, a high quality implementation will try to queue connection attempts so that a server can handle simultaneous requests from several clients. A high quality implementation may also provide a mechanism, through the info arguments to MPI_OPEN_PORT, MPI_COMM_ACCEPT and/or MPI_COMM_CONNECT, for the user to control timeout and queuing behavior. ( End of advice to implementors.)
    MPI provides no guarantee of fairness in servicing connection attempts. That is, connection attempts are not necessarily satisfied in the order they were initiated and competition from other connection attempts may prevent a particular connection attempt from being satisfied.

    port_name is the address of the server. It must be the same as the name returned by MPI_OPEN_PORT on the server. Some freedom is allowed here. If there are equivalent forms of port_name, an implementation may accept them as well. For instance, if port_name is (hostname:port), an implementation may accept (ip_address:port) as well.



    Up: Establishing Communication Next: Name Publishing Previous: Server Routines


    5.4.4. Name Publishing


    Up: Establishing Communication Next: Reserved Key Values Previous: Client Routines

    The routines in this section provide a mechanism for publishing names. A ( service_name, port_name) pair is published by the server, and may be retrieved by a client using the service_name only. An MPI implementation defines the scope of the service_name, that is, the domain over which the service_name can be retrieved. If the domain is the empty set, that is, if no client can retrieve the information, then we say that name publishing is not supported. Implementations should document how the scope is determined. High quality implementations will give some control to users through the info arguments to name publishing functions. Examples are given in the descriptions of individual functions.

    MPI_PUBLISH_NAME(service_name, info, port_name)
    IN service_namea service name to associate with the port (string)
    IN infoimplementation-specific information (handle)
    IN port_namea port name (string)
    int MPI_Publish_name(char *service_name, MPI_Info info, char *port_name)
    MPI_PUBLISH_NAME(SERVICE_NAME, INFO, PORT_NAME, IERROR)
    INTEGER INFO, IERROR
    CHARACTER*(*) SERVICE_NAME, PORT_NAME
    void MPI::Publish_name(const char* service_name, const MPI::Info& info, const char* port_name)

    This routine publishes the pair ( port_name, service_name) so that an application may retrieve a system-supplied port_name using a well-known service_name.

    The implementation must define the scope of a published service name, that is, the domain over which the service name is unique, and conversely, the domain over which the (port name, service name) pair may be retrieved. For instance, a service name may be unique to a job (where job is defined by a distributed operating system or batch scheduler), unique to a machine, or unique to a Kerberos realm. The scope may depend on the info argument to MPI_PUBLISH_NAME.

    MPI permits publishing more than one service_name for a single port_name. On the other hand, if service_name has already been published within the scope determined by info, the behavior of MPI_PUBLISH_NAME is undefined. An MPI implementation may, through a mechanism in the info argument to MPI_PUBLISH_NAME, provide a way to allow multiple servers with the same service in the same scope. In this case, an implementation-defined policy will determine which of several port names is returned by MPI_LOOKUP_NAME.

    Note that while service_name has a limited scope, determined by the implementation, port_name always has global scope within the communication universe used by the implementation (i.e., it is globally unique).

    port_name should be the name of a port established by MPI_OPEN_PORT and not yet deleted by MPI_CLOSE_PORT. If it is not, the result is undefined.


    Advice to implementors.

    In some cases, an MPI implementation may use a name service that a user can also access directly. In this case, a name published by MPI could easily conflict with a name published by a user. In order to avoid such conflicts, MPI implementations should mangle service names so that they are unlikely to conflict with user code that makes use of the same service. Such name mangling will of course be completely transparent to the user.

    The following situation is problematic but unavoidable, if we want to allow implementations to use nameservers. Suppose there are multiple instances of ``ocean'' running on a machine. If the scope of a service name is confined to a job, then multiple oceans can coexist. If an implementation provides site-wide scope, however, multiple instances are not possible as all calls to MPI_PUBLISH_NAME after the first may fail. There is no universal solution to this.

    To handle these situations, a high quality implementation should make it possible to limit the domain over which names are published. ( End of advice to implementors.)
    MPI_UNPUBLISH_NAME(service_name, info, port_name)
    IN service_namea service name (string)
    IN infoimplementation-specific information (handle)
    IN port_namea port name (string)
    int MPI_Unpublish_name(char *service_name, MPI_Info info, char *port_name)
    MPI_UNPUBLISH_NAME(SERVICE_NAME, INFO, PORT_NAME, IERROR)
    INTEGER INFO, IERROR
    CHARACTER*(*) SERVICE_NAME, PORT_NAME
    void MPI::Unpublish_name(const char* service_name, const MPI::Info& info, const char* port_name)

    This routine unpublishes a service name that has been previously published. Attempting to unpublish a name that has not been published or has already been unpublished is erroneous and is indicated by the error class MPI_ERR_SERVICE.

    All published names must be unpublished before the corresponding port is closed and before the publishing process exits. The behavior of MPI_UNPUBLISH_NAME is implementation dependent when a process tries to unpublish a name that it did not publish.

    If the info argument was used with MPI_PUBLISH_NAME to tell the implementation how to publish names, the implementation may require that info passed to MPI_UNPUBLISH_NAME contain information to tell the implementation how to unpublish a name.

    MPI_LOOKUP_NAME(service_name, info, port_name)
    IN service_namea service name (string)
    IN infoimplementation-specific information (handle)
    OUT port_namea port name (string)
    int MPI_Lookup_name(char *service_name, MPI_Info info, char *port_name)
    MPI_LOOKUP_NAME(SERVICE_NAME, INFO, PORT_NAME, IERROR)
    CHARACTER*(*) SERVICE_NAME, PORT_NAME
    INTEGER INFO, IERROR
    void MPI::Lookup_name(const char* service_name, const MPI::Info& info, char* port_name)

    This function retrieves a port_name published by MPI_PUBLISH_NAME with service_name. If service_name has not been published, it raises an error in the error class MPI_ERR_NAME. The application must supply a port_name buffer large enough to hold the largest possible port name (see discussion above under MPI_OPEN_PORT).

    If an implementation allows multiple entries with the same service_name within the same scope, a particular port_name is chosen in a way determined by the implementation.

    If the info argument was used with MPI_PUBLISH_NAME to tell the implementation how to publish names, a similar info argument may be required for MPI_LOOKUP_NAME.



    Up: Establishing Communication Next: Reserved Key Values Previous: Client Routines


    5.4.5. Reserved Key Values


    Up: Establishing Communication Next: Client/Server Examples Previous: Name Publishing

    The following key values are reserved. An implementation is not required to interpret these key values, but if it does interpret the key value, it must provide the functionality described.

    { ip_port}
    Value contains IP port number at which to establish a port. (Reserved for MPI_OPEN_PORT only).
    { ip_address}
    Value contains IP address at which to establish a port. If the address is not a valid IP address of the host on which the MPI_OPEN_PORT call is made, the results are undefined. (Reserved for MPI_OPEN_PORT only).



    Up: Establishing Communication Next: Client/Server Examples Previous: Name Publishing


    5.4.6. Client/Server Examples


    Up: Establishing Communication Next: Simplest Example --- Completely Portable. Previous: Reserved Key Values



    Up: Establishing Communication Next: Simplest Example --- Completely Portable. Previous: Reserved Key Values


    5.4.6.1. Simplest Example --- Completely Portable.


    Up: Client/Server Examples Next: Ocean/Atmosphere - Relies on Name Publishing Previous: Client/Server Examples

    The following example shows the simplest way to use the client/server interface. It does not use service names at all.

    On the server side:

        char myport[MPI_MAX_PORT_NAME]; 
        MPI_Comm intercomm; 
        /* ... */ 
        MPI_Open_port(MPI_INFO_NULL, myport); 
        printf("port name is: %s\n", myport); 
     
        MPI_Comm_accept(myport, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm); 
        /* do something with intercomm */ 
    
    The server prints out the port name to the terminal and the user must type it in when starting up the client (assuming the MPI implementation supports stdin such that this works). On the client side:
        MPI_Comm intercomm; 
        char name[MPI_MAX_PORT_NAME]; 
        printf("enter port name: ");  
        gets(name); 
        MPI_Comm_connect(name, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm); 
    



    Up: Client/Server Examples Next: Ocean/Atmosphere - Relies on Name Publishing Previous: Client/Server Examples


    5.4.6.2. Ocean/Atmosphere - Relies on Name Publishing


    Up: Client/Server Examples Next: Simple Client-Server Example. Previous: Simplest Example --- Completely Portable.

    In this example, the ``ocean'' application is the ``server'' side of a coupled ocean-atmosphere climate model. It assumes that the MPI implementation publishes names.


        MPI_Open_port(MPI_INFO_NULL, port_name); 
        MPI_Publish_name("ocean", MPI_INFO_NULL, port_name); 
     
        MPI_Comm_accept(port_name, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm); 
        /* do something with intercomm */ 
        MPI_Unpublish_name("ocean", MPI_INFO_NULL, port_name); 
     
    
    On the client side:
        MPI_Lookup_name("ocean", MPI_INFO_NULL, port_name); 
        MPI_Comm_connect( port_name, MPI_INFO_NULL, 0, MPI_COMM_SELF,  
                          &intercomm); 
    



    Up: Client/Server Examples Next: Simple Client-Server Example. Previous: Simplest Example --- Completely Portable.


    5.4.6.3. Simple Client-Server Example.


    Up: Client/Server Examples Next: Other Functionality Previous: Ocean/Atmosphere - Relies on Name Publishing

    This is a simple example; the server accepts only a single connection at a time and serves that connection until the client requests to be disconnected. The server is a single process.

    Here is the server. It accepts a single connection and then processes data until it receives a message with tag 1. A message with tag 0 tells the server to exit.

    #include "mpi.h" 
    int main( int argc, char **argv ) 
    { 
        MPI_Comm client; 
        MPI_Status status; 
        char port_name[MPI_MAX_PORT_NAME]; 
        double buf[MAX_DATA]; 
        int    size, again; 
     
        MPI_Init( &argc, &argv ); 
        MPI_Comm_size(MPI_COMM_WORLD, &size); 
        if (size != 1) error(FATAL, "Server too big"); 
        MPI_Open_port(MPI_INFO_NULL, port_name); 
        printf("server available at %s\n",port_name); 
        while (1) { 
            MPI_Comm_accept( port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD,  
                             &client ); 
            again = 1; 
            while (again) { 
                MPI_Recv( buf, MAX_DATA, MPI_DOUBLE,  
                          MPI_ANY_SOURCE, MPI_ANY_TAG, client, &status ); 
                switch (status.MPI_TAG) { 
                    case 0: MPI_Comm_free( &client ); 
                            MPI_Close_port(port_name); 
                            MPI_Finalize(); 
                            return 0; 
                    case 1: MPI_Comm_disconnect( &client ); 
                            again = 0; 
                            break; 
                    case 2: /* do something */ 
                    ... 
                    default: 
                            /* Unexpected message type */ 
                            MPI_Abort( MPI_COMM_WORLD, 1 ); 
                    } 
                } 
            } 
    } 
    
    Here is the client.


    #include "mpi.h" 
    int main( int argc, char **argv ) 
    { 
        MPI_Comm server; 
        double buf[MAX_DATA]; 
        char port_name[MPI_MAX_PORT_NAME]; 
     
        MPI_Init( &argc, &argv ); 
        strcpy(port_name, argv[1] );/* assume server's name is cmd-line arg */ 
     
        MPI_Comm_connect( port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD,  
                          &server ); 
     
        while (!done) { 
            tag = 2; /* Action to perform */ 
            MPI_Send( buf, n, MPI_DOUBLE, 0, tag, server ); 
            /* etc */ 
            } 
        MPI_Send( buf, 0, MPI_DOUBLE, 0, 1, server ); 
        MPI_Comm_disconnect( &server ); 
        MPI_Finalize(); 
        return 0; 
    } 
    



    Up: Client/Server Examples Next: Other Functionality Previous: Ocean/Atmosphere - Relies on Name Publishing


    5.5. Other Functionality


    Up: Process Creation and Management Next: Universe Size Previous: Simple Client-Server Example.



    Up: Process Creation and Management Next: Universe Size Previous: Simple Client-Server Example.


    5.5.1. Universe Size


    Up: Other Functionality Next: Singleton MPI_INIT Previous: Other Functionality

    Many ``dynamic'' MPI applications are expected to exist in a static runtime environment, in which resources have been allocated before the application is run. When a user (or possibly a batch system) runs one of these quasi-static applications, she will usually specify a number of processes to start and a total number of processes that are expected. An application simply needs to know how many slots there are, i.e., how many processes it should spawn.

    MPI provides an attribute on MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, that allows the application to obtain this information in a portable manner. This attribute indicates the total number of processes that are expected. In Fortran, the attribute is the integer value. In C, the attribute is a pointer to the integer value. An application typically subtracts the size of MPI_COMM_WORLD from MPI_UNIVERSE_SIZE to find out how many processes it should spawn. MPI_UNIVERSE_SIZE is initialized in MPI_INIT and is not changed by MPI. If defined, it has the same value on all processes of MPI_COMM_WORLD. MPI_UNIVERSE_SIZE is determined by the application startup mechanism in a way not specified by MPI. (The size of MPI_COMM_WORLD is another example of such a parameter.)

    Possibilities for how MPI_UNIVERSE_SIZE might be set include


    An implementation must document how MPI_UNIVERSE_SIZE is set. An implementation may not support the ability to set MPI_UNIVERSE_SIZE, in which case the attribute MPI_UNIVERSE_SIZE is not set.

    MPI_UNIVERSE_SIZE is a recommendation, not necessarily a hard limit. For instance, some implementations may allow an application to spawn 50 processes per processor, if they are requested. However, it is likely that the user only wants to spawn one process per processor.

    MPI_UNIVERSE_SIZE is assumed to have been specified when an application was started, and is in essence a portable mechanism to allow the user to pass to the application (through the MPI process startup mechanism, such as mpiexec) a piece of critical runtime information. Note that no interaction with the runtime environment is required. If the runtime environment changes size while an application is running, MPI_UNIVERSE_SIZE is not updated, and the application must find out about the change through direct communication with the runtime system.



    Up: Other Functionality Next: Singleton MPI_INIT Previous: Other Functionality


    5.5.2. Singleton MPI_INIT


    Up: Other Functionality Next: MPI_APPNUM Previous: Universe Size

    A high-quality implementation will allow any process (including those not started with a ``parallel application'' mechanism) to become an MPI process by calling MPI_INIT. Such a process can then connect to other MPI processes using the MPI_COMM_ACCEPT and MPI_COMM_CONNECT routines, or spawn other MPI processes. MPI does not mandate this behavior, but strongly encourages it where technically feasible.


    Advice to implementors.

    To start an MPI-1 application with more than one process requires some special coordination. The processes must be started at the ``same'' time, they must have a mechanism to establish communication, etc. Either the user or the operating system must take special steps beyond simply starting processes.

    When an application enters MPI_INIT, clearly it must be able to determine if these special steps were taken. MPI-1 does not say what happens if these special steps were not taken --- presumably this is treated as an error in starting the MPI application. MPI-2 recommends the following behavior.

    If a process enters MPI_INIT and determines that no special steps were taken (i.e., it has not been given the information to form an MPI_COMM_WORLD with other processes) it succeeds and forms a singleton MPI program, that is, one in which MPI_COMM_WORLD has size 1.

    In some implementations, MPI may not be able to function without an ``MPI environment.'' For example, MPI may require that daemons be running or MPI may not be able to work at all on the front-end of an MPP. In this case, an MPI implementation may either

      1. Create the environment (e.g., start a daemon) or
      2. Raise an error if it cannot create the environment and the environment has not been started independently.
    A high quality implementation will try to create a singleton MPI process and not raise an error.

    ( End of advice to implementors.)



    Up: Other Functionality Next: MPI_APPNUM Previous: Universe Size


    5.5.3. MPI_APPNUM


    Up: Other Functionality Next: Releasing Connections Previous: Singleton MPI_INIT

    There is a predefined attribute MPI_APPNUM of MPI_COMM_WORLD. In Fortran, the attribute is an integer value. In C, the attribute is a pointer to an integer value. If a process was spawned with MPI_COMM_SPAWN_MULTIPLE, MPI_APPNUM is the command number that generated the current process. Numbering starts from zero. If a process was spawned with MPI_COMM_SPAWN, it will have MPI_APPNUM equal to zero.

    Additionally, if the process was not started by a spawn call, but by an implementation-specific startup mechanism that can handle multiple process specifications, MPI_APPNUM should be set to the number of the corresponding process specification. In particular, if it is started with

        mpiexec spec0 [: spec1 : spec2 : ...] 
    
    MPI_APPNUM should be set to the number of the corresponding specification.

    If an application was not spawned with MPI_COMM_SPAWN or MPI_COMM_SPAWN_MULTIPLE, and MPI_APPNUM doesn't make sense in the context of the implementation-specific startup mechanism, MPI_APPNUM is not set.

    MPI implementations may optionally provide a mechanism to override the value of MPI_APPNUM through the info argument. MPI reserves the following key for all SPAWN calls.



    Rationale.

    When a single application is started, it is able to figure out how many processes there are by looking at the size of MPI_COMM_WORLD. An application consisting of multiple SPMD sub-applications has no way to find out how many sub-applications there are and to which sub-application the process belongs. While there are ways to figure it out in special cases, there is no general mechanism. MPI_APPNUM provides such a general mechanism. ( End of rationale.)



    Up: Other Functionality Next: Releasing Connections Previous: Singleton MPI_INIT


    5.5.4. Releasing Connections


    Up: Other Functionality Next: Another Way to Establish MPI Communication Previous: MPI_APPNUM

    Before a client and server connect, they are independent MPI applications. An error in one does not affect the other. After establishing a connection with MPI_COMM_CONNECT and MPI_COMM_ACCEPT, an error in one may affect the other. It is desirable for a client and server to be able to disconnect, so that an error in one will not affect the other. Similarly, it might be desirable for a parent and child to disconnect, so that errors in the child do not affect the parent, or vice-versa.


    The following additional rules apply to MPI-1 functions:
    MPI_COMM_DISCONNECT(comm)
    INOUT comm communicator (handle)
    int MPI_Comm_disconnect(MPI_Comm *comm)
    MPI_COMM_DISCONNECT(COMM, IERROR)
    INTEGER COMM, IERROR

    void MPI::Comm::Disconnect()

    This function waits for all pending communication on comm to complete internally, deallocates the communicator object, and sets the handle to MPI_COMM_NULL. It is a collective operation.

    It may not be called with the communicator MPI_COMM_WORLD or MPI_COMM_SELF.

    MPI_COMM_DISCONNECT may be called only if all communication is complete and matched, so that buffered data can be delivered to its destination. This requirement is the same as for MPI_FINALIZE.

    MPI_COMM_DISCONNECT has the same action as MPI_COMM_FREE, except that it waits for pending communication to finish internally and enables the guarantee about the behavior of disconnected processes.


    Advice to users.

    To disconnect two processes you may need to call MPI_COMM_DISCONNECT, MPI_WIN_FREE and MPI_FILE_CLOSE to remove all communication paths between the two processes. Notes that it may be necessary to disconnect several communicators (or to free several windows or files) before two processes are completely independent. ( End of advice to users.)


    Rationale.

    It would be nice to be able to use MPI_COMM_FREE instead, but that function explicitly does not wait for pending communication to complete. ( End of rationale.)



    Up: Other Functionality Next: Another Way to Establish MPI Communication Previous: MPI_APPNUM


    5.5.5. Another Way to Establish MPI Communication


    Up: Other Functionality Next: One-Sided Communications Previous: Releasing Connections

    MPI_COMM_JOIN(fd, intercomm)
    IN fdsocket file descriptor
    OUT intercommnew intercommunicator (handle)
    int MPI_Comm_join(int fd, MPI_Comm *intercomm)
    MPI_COMM_JOIN(FD, INTERCOMM, IERROR)
    INTEGER FD, INTERCOMM, IERROR
    static MPI::Intercomm MPI::Comm::Join(const int fd)

    MPI_COMM_JOIN is intended for MPI implementations that exist in an environment supporting the Berkeley Socket interface [14,17]. Implementations that exist in an environment not supporting Berkeley Sockets should provide the entry point for MPI_COMM_JOIN and should return MPI_COMM_NULL.

    This call creates an intercommunicator from the union of two MPI processes which are connected by a socket. MPI_COMM_JOIN should normally succeed if the local and remote processes have access to the same implementation-defined MPI communication universe.


    Advice to users. An MPI implementation may require a specific communication medium for MPI communication, such as a shared memory segment or a special switch. In this case, it may not be possible for two processes to successfully join even if there is a socket connecting them and they are using the same MPI implementation. ( End of advice to users.)

    Advice to implementors.

    A high quality implementation will attempt to establish communication over a slow medium if its preferred one is not available. If implementations do not do this, they must document why they cannot do MPI communication over the medium used by the socket (especially if the socket is a TCP connection). ( End of advice to implementors.)
    fd is a file descriptor representing a socket of type SOCK_STREAM (a two-way reliable byte-stream connection). Non-blocking I/O and asynchronous notification via SIGIO must not be enabled for the socket. The socket must be in a connected state. The socket must be quiescent when MPI_COMM_JOIN is called (see below). It is the responsibility of the application to create the socket using standard socket API calls.

    MPI_COMM_JOIN must be called by the process at each end of the socket. It does not return until both processes have called MPI_COMM_JOIN. The two processes are referred to as the local and remote processes.

    MPI uses the socket to bootstrap creation of the intercommunicator, and for nothing else. Upon return from MPI_COMM_JOIN, the file descriptor will be open and quiescent (see below).

    If MPI is unable to create an intercommunicator, but is able to leave the socket in its original state, with no pending communication, it succeeds and sets intercomm to MPI_COMM_NULL.

    The socket must be quiescent before MPI_COMM_JOIN is called and after MPI_COMM_JOIN returns. More specifically, on entry to MPI_COMM_JOIN, a read on the socket will not read any data that was written to the socket before the remote process called MPI_COMM_JOIN. On exit from MPI_COMM_JOIN, a read will not read any data that was written to the socket before the remote process returned from MPI_COMM_JOIN. It is the responsibility of the application to ensure the first condition, and the responsibility of the MPI implementation to ensure the second. In a multithreaded application, the application must ensure that one thread does not access the socket while another is calling MPI_COMM_JOIN, or call MPI_COMM_JOIN concurrently.


    Advice to implementors.

    MPI is free to use any available communication path(s) for MPI messages in the new communicator; the socket is only used for the initial handshaking. ( End of advice to implementors.)
    MPI_COMM_JOIN uses non- MPI communication to do its work. The interaction of non- MPI communication with pending MPI communication is not defined. Therefore, the result of calling MPI_COMM_JOIN on two connected processes (see Section Releasing Connections for the definition of connected) is undefined.

    The returned communicator may be used to establish MPI communication with additional processes, through the usual MPI communicator creation mechanisms.



    Up: Other Functionality Next: One-Sided Communications Previous: Releasing Connections


    6. One-Sided Communications


    Up: Contents Next: Introduction Previous: Another Way to Establish MPI Communication



    Up: Contents Next: Introduction Previous: Another Way to Establish MPI Communication


    6.1. Introduction


    Up: One-Sided Communications Next: Initialization Previous: One-Sided Communications

    Remote Memory Access ( RMA) extends the communication mechanisms of MPI by allowing one process to specify all communication parameters, both for the sending side and for the receiving side. This mode of communication facilitates the coding of some applications with dynamically changing data access patterns where the data distribution is fixed or slowly changing. In such a case, each process can compute what data it needs to access or update at other processes. However, processes may not know which data in their own memory need to be accessed or updated by remote processes, and may not even know the identity of these processes. Thus, the transfer parameters are all available only on one side. Regular send/receive communication requires matching operations by sender and receiver. In order to issue the matching operations, an application needs to distribute the transfer parameters. This may require all processes to participate in a time consuming global computation, or to periodically poll for potential communication requests to receive and act upon. The use of RMA communication mechanisms avoids the need for global computations or explicit polling. A generic example of this nature is the execution of an assignment of the form A = B(map), where map is a permutation vector, and A, B and map are distributed in the same manner.

    Message-passing communication achieves two effects: communication of data from sender to receiver; and synchronization of sender with receiver. The RMA design separates these two functions. Three communication calls are provided: MPI_PUT (remote write), MPI_GET (remote read) and MPI_ACCUMULATE (remote update). A larger number of synchronization calls are provided that support different synchronization styles. The design is similar to that of weakly coherent memory systems: correct ordering of memory accesses has to be imposed by the user, using synchronization calls; the implementation can delay communication operations until the synchronization calls occur, for efficiency.

    The design of the RMA functions allows implementors to take advantage, in many cases, of fast communication mechanisms provided by various platforms, such as coherent or noncoherent shared memory, DMA engines, hardware-supported put/get operations, communication coprocessors, etc. The most frequently used RMA communication mechanisms can be layered on top of message passing. However, support for asynchronous communication agents (handlers, threads, etc.) is needed, for certain RMA functions, in a distributed memory environment.

    We shall denote by origin the process that performs the call, and by target the process in which the memory is accessed. Thus, in a put operation, source=origin and destination=target; in a get operation, source=target and destination=origin.



    Up: One-Sided Communications Next: Initialization Previous: One-Sided Communications


    6.2. Initialization


    Up: One-Sided Communications Next: Window Creation Previous: Introduction



    Up: One-Sided Communications Next: Window Creation Previous: Introduction


    6.2.1. Window Creation


    Up: Initialization Next: Window Attributes Previous: Initialization

    The initialization operation allows each process in an intracommunicator group to specify, in a collective operation, a ``window'' in its memory that is made accessible to accesses by remote processes. The call returns an opaque object that represents the group of processes that own and access the set of windows, and the attributes of each window, as specified by the initialization call.

    MPI_WIN_CREATE(base, size, disp_unit, info, comm, win)
    IN baseinitial address of window (choice)
    IN sizesize of window in bytes (nonnegative integer)
    IN disp_unitlocal unit size for displacements, in bytes (positive integer)
    IN infoinfo argument (handle)
    IN commcommunicator (handle)
    OUT winwindow object returned by the call (handle)

    int MPI_Win_create(void *base, MPI_Aint size, int disp_unit, MPI_Info info, MPI_Comm comm, MPI_Win *win)

    MPI_WIN_CREATE(BASE, SIZE, DISP_UNIT, INFO, COMM, WIN, IERROR)
    <type> BASE(*)
    INTEGER(KIND=MPI_ADDRESS_KIND) SIZE
    INTEGER DISP_UNIT, INFO, COMM, WIN, IERROR

    static MPI::Win MPI::Win::Create(const void* base, MPI::Aint size, int disp_unit, const MPI::Info& info, const MPI::Intracomm& comm)

    This is a collective call executed by all processes in the group of comm. It returns a window object that can be used by these processes to perform RMA operations. Each process specifies a window of existing memory that it exposes to RMA accesses by the processes in the group of comm. The window consists of size bytes, starting at address base. A process may elect to expose no memory by specifying size = 0.

    The displacement unit argument is provided to facilitate address arithmetic in RMA operations: the target displacement argument of an RMA operation is scaled by the factor disp_unit specified by the target process, at window creation.


    Rationale.

    The window size is specified using an address sized integer, so as to allow windows that span more than 4 GB of address space. (Even if the physical memory size is less than 4 GB, the address range may be larger than 4 GB, if addresses are not contiguous.) ( End of rationale.)

    Advice to users.

    Common choices for disp_unit are 1 (no scaling), and (in C syntax) sizeof(type), for a window that consists of an array of elements of type type. The later choice will allow one to use array indices in RMA calls, and have those scaled correctly to byte displacements, even in a heterogeneous environment. ( End of advice to users.)
    The info argument provides optimization hints to the runtime about the expected usage pattern of the window. The following info key is predefined:

    no_locks --- if set to true, then the implementation may assume that the local window is never locked (by a call to MPI_WIN_LOCK). This implies that this window is not used for 3-party communication, and RMA can be implemented with no (less) asynchronous agent activity at this process.

    The various processes in the group of comm may specify completely different target windows, in location, size, displacement units and info arguments. As long as all the get, put and accumulate accesses to a particular process fit their specific target window this should pose no problem. The same area in memory may appear in multiple windows, each associated with a different window object. However, concurrent communications to distinct, overlapping windows may lead to erroneous results.


    Advice to users.

    A window can be created in any part of the process memory. However, on some systems, the performance of windows in memory allocated by MPI_ALLOC_MEM (Section Memory Allocation ) will be better. Also, on some systems, performance is improved when window boundaries are aligned at ``natural'' boundaries (word, double-word, cache line, page frame, etc.). ( End of advice to users.)

    Advice to implementors.

    In cases where RMA operations use different mechanisms in different memory areas (e.g., load/store in a shared memory segment, and an asynchronous handler in private memory), the MPI_WIN_CREATE call needs to figure out which type of memory is used for the window. To do so, MPI maintains, internally, the list of memory segments allocated by MPI_ALLOC_MEM, or by other, implementation specific, mechanisms, together with information on the type of memory segment allocated. When a call to MPI_WIN_CREATE occurs, then MPI checks which segment contains each window, and decides, accordingly, which mechanism to use for RMA operations.

    Vendors may provide additional, implementation-specific mechanisms to allow ``good'' memory to be used for static variables.

    Implementors should document any performance impact of window alignment. ( End of advice to implementors.)
    MPI_WIN_FREE(win)
    INOUT winwindow object (handle)

    int MPI_Win_free(MPI_Win *win)

    MPI_WIN_FREE(WIN, IERROR)
    INTEGER WIN, IERROR

    void MPI::Win::Free()

    Frees the window object win and returns a null handle (equal to MPI_WIN_NULL). This is a collective call executed by all processes in the group associated with win. MPI_WIN_FREE(win) can be invoked by a process only after it has completed its involvement in RMA communications on window win: i.e., the process has called MPI_WIN_FENCE, or called MPI_WIN_WAIT to match a previous call to MPI_WIN_POST or called MPI_WIN_COMPLETE to match a previous call to MPI_WIN_START or called MPI_WIN_UNLOCK to match a previous call to MPI_WIN_LOCK. When the call returns, the window memory can be freed.


    Advice to implementors.

    MPI_WIN_FREE requires a barrier synchronization: no process can return from free until all processes in the group of win called free. This, to ensure that no process will attempt to access a remote window (e.g., with lock/unlock) after it was freed. ( End of advice to implementors.)



    Up: Initialization Next: Window Attributes Previous: Initialization


    6.2.2. Window Attributes


    Up: Initialization Next: Communication Calls Previous: Window Creation

    The following three attributes are cached with a window, when the window is created.
    MPI_WIN_BASEwindow base address.
    MPI_WIN_SIZE window size, in bytes.
    MPI_WIN_DISP_UNITdisplacement unit associated with the window.

    In C, calls to MPI_Win_get_attr(win, MPI_WIN_BASE, &base, &flag),
    MPI_Win_get_attr(win, MPI_WIN_SIZE, &size, &flag) and
    MPI_Win_get_attr(win, MPI_WIN_DISP_UNIT, &disp_unit, &flag) will return in base a pointer to the start of the window win, and will return in size and disp_unit pointers to the size and displacement unit of the window, respectively. And similarly, in C++.

    In Fortran, calls to MPI_WIN_GET_ATTR(win, MPI_WIN_BASE, base, flag, ierror),
    MPI_WIN_GET_ATTR(win, MPI_WIN_SIZE, size, flag, ierror) and
    MPI_WIN_GET_ATTR(win, MPI_WIN_DISP_UNIT, disp_unit, flag, ierror) will return in base, size and disp_unit the (integer representation of) the base address, the size and the displacement unit of the window win, respectively. (The window attribute access functions are defined in Section New Attribute Caching Functions .)

    The other ``window attribute,'' namely the group of processes attached to the window, can be retrieved using the call below.

    MPI_WIN_GET_GROUP(win, group)
    IN winwindow object (handle)
    OUT groupgroup of processes which share access to the window (handle)

    int MPI_Win_get_group(MPI_Win win, MPI_Group *group)
    MPI_WIN_GET_GROUP(WIN, GROUP, IERROR)
    INTEGER WIN, GROUP, IERROR
    MPI::Group MPI::Win::Get_group() const

    MPI_WIN_GET_GROUP returns a duplicate of the group of the communicator used to create the window. associated with win. The group is returned in group.



    Up: Initialization Next: Communication Calls Previous: Window Creation


    6.3. Communication Calls


    Up: One-Sided Communications Next: Put Previous: Window Attributes

    MPI supports three RMA communication calls: MPI_PUT transfers data from the caller memory (origin) to the target memory; MPI_GET transfers data from the target memory to the caller memory; and MPI_ACCUMULATE updates locations in the target memory, e.g. by adding to these locations values sent from the caller memory. These operations are nonblocking: the call initiates the transfer, but the transfer may continue after the call returns. The transfer is completed, both at the origin and at the target, when a subsequent synchronization call is issued by the caller on the involved window object. These synchronization calls are described in Section Synchronization Calls .

    The local communication buffer of an RMA call should not be updated, and the local communication buffer of a get call should not be accessed after the RMA call, until the subsequent synchronization call completes.


    Rationale.

    The rule above is more lenient than for message passing, where we do not allow two concurrent sends, with overlapping send buffers. Here, we allow two concurrent puts with overlapping send buffers. The reasons for this relaxation are

      1. Users do not like that restriction, which is not very natural (it prohibits concurrent reads).
      2. Weakening the rule does not prevent efficient implementation, as far as we know.


      3. Weakening the rule is important for performance of RMA: we want to associate one synchronization call with as many RMA operations is possible. If puts from overlapping buffers cannot be concurrent, then we need to needlessly add synchronization points in the code.

    ( End of rationale.)
    It is erroneous to have concurrent conflicting accesses to the same memory location in a window; if a location is updated by a put or accumulate operation, then this location cannot be accessed by a load or another RMA operation until the updating operation has completed at the target. There is one exception to this rule; namely, the same location can be updated by several concurrent accumulate calls, the outcome being as if these updates occurred in some order. In addition, a window cannot concurrently be updated by a put or accumulate operation and by a local store operation. This, even if these two updates access different locations in the window. The last restriction enables more efficient implementations of RMA operations on many systems. These restrictions are described in more detail in Section Semantics and Correctness .

    The calls use general datatype arguments to specify communication buffers at the origin and at the target. Thus, a transfer operation may also gather data at the source and scatter it at the destination. However, all arguments specifying both communication buffers are provided by the caller.

    For all three calls, the target process may be identical with the origin process; i.e., a process may use an RMA operation to move data in its memory.


    Rationale.

    The choice of supporting ``self-communication'' is the same as for message passing. It simplifies some coding, and is very useful with accumulate operations, to allow atomic updates of local variables. ( End of rationale.)



    Up: One-Sided Communications Next: Put Previous: Window Attributes


    6.3.1. Put


    Up: Communication Calls Next: Get Previous: Communication Calls

    The execution of a put operation is similar to the execution of a send by the origin process and a matching receive by the target process. The obvious difference is that all arguments are provided by one call --- the call executed by the origin process.

    MPI_PUT(origin_addr, origin_count, origin_datatype, target_rank, target_disp, target_count, target_datatype, win)
    IN origin_addrinitial address of origin buffer (choice)
    IN origin_countnumber of entries in origin buffer (nonnegative integer)
    IN origin_datatypedatatype of each entry in origin buffer (handle)
    IN target_rankrank of target (nonnegative integer)
    IN target_dispdisplacement from start of window to target buffer (nonnegative integer)
    IN target_countnumber of entries in target buffer (nonnegative integer)
    IN target_datatypedatatype of each entry in target buffer (handle)
    IN winwindow object used for communication (handle)

    int MPI_Put(void *origin_addr, int origin_count, MPI_Datatype origin_datatype, int target_rank, MPI_Aint target_disp, int target_count, MPI_Datatype target_datatype, MPI_Win win)

    MPI_PUT(ORIGIN_ADDR, ORIGIN_COUNT, ORIGIN_DATATYPE, TARGET_RANK, TARGET_DISP, TARGET_COUNT, TARGET_DATATYPE, WIN, IERROR)
    <type> ORIGIN_ADDR(*)
    INTEGER(KIND=MPI_ADDRESS_KIND) TARGET_DISP
    INTEGER ORIGIN_COUNT, ORIGIN_DATATYPE, TARGET_RANK, TARGET_COUNT, TARGET_DATATYPE, WIN, IERROR

    void MPI::Win::Put(const void* origin_addr, int origin_count, const MPI::Datatype& origin_datatype, int target_rank, MPI::Aint target_disp, int target_count, const MPI::Datatype& target_datatype) const

    Transfers origin_count successive entries of the type specified by the origin_datatype, starting at address origin_addr on the origin node to the target node specified by the win, target_rank pair. The data are written in the target buffer at address target_addr = window_base + target_disp×disp_unit, where window_base and disp_unit are the base address and window displacement unit specified at window initialization, by the target process.

    The target buffer is specified by the arguments target_count and target_datatype.

    The data transfer is the same as that which would occur if the origin process executed a send operation with arguments origin_addr, origin_count, origin_datatype, target_rank, tag, comm, and the target process executed a receive operation with arguments target_addr, target_count, target_datatype, source, tag, comm, where target_addr is the target buffer address computed as explained above, and comm is a communicator for the group of win.

    The communication must satisfy the same constraints as for a similar message-passing communication. The target_datatype may not specify overlapping entries in the target buffer. The message sent must fit, without truncation, in the target buffer. Furthermore, the target buffer must fit in the target window.

    The target_datatype argument is a handle to a datatype object defined at the origin process. However, this object is interpreted at the target process: the outcome is as if the target datatype object was defined at the target process, by the same sequence of calls used to define it at the origin process. The target datatype must contain only relative displacements, not absolute addresses. The same holds for get and accumulate.


    Advice to users.

    The target_datatype argument is a handle to a datatype object that is defined at the origin process, even though it defines a data layout in the target process memory. This causes no problems in a homogeneous environment, or in a heterogeneous environment, if only portable datatypes are used (portable datatypes are defined in Section Semantic Terms ).

    The performance of a put transfer can be significantly affected, on some systems, from the choice of window location and the shape and location of the origin and target buffer: transfers to a target window in memory allocated by MPI_ALLOC_MEM may be much faster on shared memory systems; transfers from contiguous buffers will be faster on most, if not all, systems; the alignment of the communication buffers may also impact performance. ( End of advice to users.)

    Advice to implementors.

    A high quality implementation will attempt to prevent remote accesses to memory outside the window that was exposed by the process. This, both for debugging purposes, and for protection with client-server codes that use RMA. I.e., a high-quality implementation will check, if possible, window bounds on each RMA call, and raise an MPI exception at the origin call if an out-of-bound situation occurred. Note that the condition can be checked at the origin. Of course, the added safety achieved by such checks has to be weighed against the added cost of such checks. ( End of advice to implementors.)



    Up: Communication Calls Next: Get Previous: Communication Calls


    6.3.2. Get


    Up: Communication Calls Next: Examples Previous: Put

    MPI_GET(origin_addr, origin_count, origin_datatype, target_rank, target_disp, target_count, target_datatype, win)
    OUT origin_addrinitial address of origin buffer (choice)
    IN origin_countnumber of entries in origin buffer (nonnegative integer)
    IN origin_datatypedatatype of each entry in origin buffer (handle)
    IN target_rankrank of target (nonnegative integer)
    IN target_dispdisplacement from window start to the beginning of the target buffer (nonnegative integer)
    IN target_countnumber of entries in target buffer (nonnegative integer)
    IN target_datatypedatatype of each entry in target buffer (handle)
    IN winwindow object used for communication (handle)

    int MPI_Get(void *origin_addr, int origin_count, MPI_Datatype origin_datatype, int target_rank, MPI_Aint target_disp, int target_count, MPI_Datatype target_datatype, MPI_Win win)

    MPI_GET(ORIGIN_ADDR, ORIGIN_COUNT, ORIGIN_DATATYPE, TARGET_RANK, TARGET_DISP, TARGET_COUNT, TARGET_DATATYPE, WIN, IERROR)
    <type> ORIGIN_ADDR(*)
    INTEGER(KIND=MPI_ADDRESS_KIND) TARGET_DISP
    INTEGER ORIGIN_COUNT, ORIGIN_DATATYPE, TARGET_RANK, TARGET_COUNT, TARGET_DATATYPE, WIN, IERROR

    void MPI::Win::Get(const void *origin_addr, int origin_count, const MPI::Datatype& origin_datatype, int target_rank, MPI::Aint target_disp, int target_count, const MPI::Datatype& target_datatype) const

    Similar to MPI_PUT, except that the direction of data transfer is reversed. Data are copied from the target memory to the origin. The origin_datatype may not specify overlapping entries in the origin buffer. The target buffer must be contained within the target window, and the copied data must fit, without truncation, in the origin buffer.



    Up: Communication Calls Next: Examples Previous: Put


    6.3.3. Examples


    Up: Communication Calls Next: Accumulate Functions Previous: Get


    Example We show how to implement the generic indirect assignment A = B(map), where A, B and map have the same distribution, and map is a permutation. To simplify, we assume a block distribution with equal size blocks.


    SUBROUTINE MAPVALS(A, B, map, m, comm, p) 
    USE MPI 
    INTEGER m, map(m), comm, p 
    REAL A(m), B(m) 
     
    INTEGER otype(p), oindex(m),   & ! used to construct origin datatypes  
         ttype(p), tindex(m),      & ! used to construct target datatypes 
         count(p), total(p),       & 
         sizeofreal, win, ierr 
     
    ! This part does the work that depends on the locations of B. 
    ! Can be reused while this does not change 
     
    CALL MPI_TYPE_EXTENT(MPI_REAL, sizeofreal, ierr) 
    CALL MPI_WIN_CREATE(B, m*sizeofreal, sizeofreal, MPI_INFO_NULL,   & 
                         comm, win, ierr) 
     
    ! This part does the work that depends on the value of map and 
    ! the locations of the arrays. 
    ! Can be reused while these do not change 
     
    ! Compute number of entries to be received from each process 
     
    DO i=1,p 
      count(i) = 0 
    END DO 
    DO i=1,m 
      j = map(i)/m+1 
      count(j) = count(j)+1 
    END DO 
     
    total(1) = 0 
    DO i=2,p 
      total(i) = total(i-1) + count(i-1) 
    END DO 
     
    DO i=1,p 
      count(i) = 0 
    END DO 
     
    ! compute origin and target indices of entries. 
    ! entry i at current process is received from location 
    ! k at process (j-1), where map(i) = (j-1)*m + (k-1), 
    ! j = 1..p and k = 1..m 
     
    DO i=1,m 
      j = map(i)/m+1 
      k = MOD(map(i),m)+1 
      count(j) = count(j)+1 
      oindex(total(j) + count(j)) = i 
      tindex(total(j) + count(j)) = k 
    END DO 
     
    ! create origin and target datatypes for each get operation 
    DO i=1,p 
      CALL MPI_TYPE_INDEXED_BLOCK(count(i), 1, oindex(total(i)+1),   & 
                                   MPI_REAL, otype(i), ierr) 
      CALL MPI_TYPE_COMMIT(otype(i), ierr) 
      CALL MPI_TYPE_INDEXED_BLOCK(count(i), 1, tindex(total(i)+1),   & 
                                  MPI_REAL, ttype(i), ierr) 
      CALL MPI_TYPE_COMMIT(ttype(i), ierr) 
    END DO 
     
    ! this part does the assignment itself 
    CALL MPI_WIN_FENCE(0, win, ierr) 
    DO i=1,p 
      CALL MPI_GET(A, 1, otype(i), i-1, 0, 1, ttype(i), win, ierr) 
    END DO 
    CALL MPI_WIN_FENCE(0, win, ierr) 
     
    CALL MPI_WIN_FREE(win, ierr) 
    DO i=1,p 
      CALL MPI_TYPE_FREE(otype(i), ierr) 
      CALL MPI_TYPE_FREE(ttype(i), ierr) 
    END DO 
    RETURN 
    END 
    


    Example

    A simpler version can be written that does not require that a datatype be built for the target buffer. But, one then needs a separate get call for each entry, as illustrated below. This code is much simpler, but usually much less efficient, for large arrays.


    SUBROUTINE MAPVALS(A, B, map, m, comm, p) 
    USE MPI 
    INTEGER m, map(m), comm, p 
    REAL A(m), B(m) 
    INTEGER sizeofreal, win, ierr 
     
    CALL MPI_TYPE_EXTENT(MPI_REAL, sizeofreal, ierr) 
    CALL MPI_WIN_CREATE(B, m*sizeofreal, sizeofreal, MPI_INFO_NULL,  & 
                        comm, win, ierr) 
     
    CALL MPI_WIN_FENCE(0, win, ierr) 
    DO i=1,m 
      j = map(i)/p 
      k = MOD(map(i),p) 
      CALL MPI_GET(A(i), 1, MPI_REAL, j, k, 1, MPI_REAL, win, ierr) 
    END DO 
    CALL MPI_WIN_FENCE(0, win, ierr) 
    CALL MPI_WIN_FREE(win, ierr) 
    RETURN 
    END 
    



    Up: Communication Calls Next: Accumulate Functions Previous: Get


    6.3.4. Accumulate Functions


    Up: Communication Calls Next: Synchronization Calls Previous: Examples

    It is often useful in a put operation to combine the data moved to the target process with the data that resides at that process, rather then replacing the data there. This will allow, for example, the accumulation of a sum by having all involved processes add their contribution to the sum variable in the memory of one process.

    MPI_ACCUMULATE(origin_addr, origin_count, origin_datatype, target_rank, target_disp, target_count, target_datatype, op, win)
    IN origin_addrinitial address of buffer (choice)
    IN origin_countnumber of entries in buffer (nonnegative integer)
    IN origin_datatypedatatype of each buffer entry (handle)
    IN target_rankrank of target (nonnegative integer)
    IN target_dispdisplacement from start of window to beginning of target buffer (nonnegative integer)
    IN target_countnumber of entries in target buffer (nonnegative integer)
    IN target_datatypedatatype of each entry in target buffer (handle)
    IN opreduce operation (handle)
    IN winwindow object (handle)

    int MPI_Accumulate(void *origin_addr, int origin_count, MPI_Datatype origin_datatype, int target_rank, MPI_Aint target_disp, int target_count, MPI_Datatype target_datatype, MPI_Op op, MPI_Win win)

    MPI_ACCUMULATE(ORIGIN_ADDR, ORIGIN_COUNT, ORIGIN_DATATYPE, TARGET_RANK, TARGET_DISP, TARGET_COUNT, TARGET_DATATYPE, OP, WIN, IERROR)
    <type> ORIGIN_ADDR(*)
    INTEGER(KIND=MPI_ADDRESS_KIND) TARGET_DISP
    INTEGER ORIGIN_COUNT, ORIGIN_DATATYPE,TARGET_RANK, TARGET_COUNT, TARGET_DATATYPE, OP, WIN, IERROR

    void MPI::Win::Accumulate(const void* origin_addr, int origin_count, const MPI::Datatype& origin_datatype, int target_rank, MPI::Aint target_disp, int target_count, const MPI::Datatype& target_datatype, const MPI::Op& op) const

    Accumulate the contents of the origin buffer (as defined by origin_addr, origin_count and origin_datatype) to the buffer specified by arguments target_count and target_datatype, at offset target_disp, in the target window specified by target_rank and win, using the operation op. This is like MPI_PUT except that data is combined into the target area instead of overwriting it.

    Any of the predefined operations for MPI_REDUCE can be used. User-defined functions cannot be used. For example, if op is MPI_SUM, each element of the origin buffer is added to the corresponding element in the target, replacing the former value in the target.

    Each datatype argument must be a predefined datatype or a derived datatype, where all basic components are of the same predefined datatype. Both datatype arguments must be constructed from the same predefined datatype. The operation op applies to elements of that predefined type. target_datatype must not specify overlapping entries, and the target buffer must fit in the target window.

    A new predefined operation, MPI_REPLACE, is defined. It corresponds to the associative function f(a,b) = b; i.e., the current value in the target memory is replaced by the value supplied by the origin.


    Advice to users.

    MPI_PUT is a special case of MPI_ACCUMULATE, with the operation MPI_REPLACE. Note, however, that MPI_PUT and MPI_ACCUMULATE have different constraints on concurrent updates. ( End of advice to users.)

    Example We want to compute . The arrays A, B and map are distributed in the same manner. We write the simple version.


    SUBROUTINE SUM(A, B, map, m, comm, p) 
    USE MPI 
    INTEGER m, map(m), comm, p, sizeofreal, win, ierr 
    REAL A(m), B(m) 
     
    CALL MPI_TYPE_EXTENT(MPI_REAL, sizeofreal, ierr) 
    CALL MPI_WIN_CREATE(B, m*sizeofreal, sizeofreal, MPI_INFO_NULL,  & 
                        comm, win, ierr) 
     
    CALL MPI_WIN_FENCE(0, win, ierr) 
    DO i=1,m 
      j = map(i)/p 
      k = MOD(map(i),p) 
      CALL MPI_ACCUMULATE(A(i), 1, MPI_REAL, j, k, 1, MPI_REAL,   & 
                          MPI_SUM, win, ierr) 
    END DO 
    CALL MPI_WIN_FENCE(0, win, ierr) 
     
    CALL MPI_WIN_FREE(win, ierr) 
    RETURN 
    END 
    
    This code is identical to the code in Example Examples , except that a call to get has been replaced by a call to accumulate. (Note that, if map is one-to-one, then the code computes B = A(map-1), which is the reverse assignment to the one computed in that previous example.) In a similar manner, we can replace in Example Examples , the call to get by a call to accumulate, thus performing the computation with only one communication between any two processes.



    Up: Communication Calls Next: Synchronization Calls Previous: Examples


    6.4. Synchronization Calls


    Up: One-Sided Communications Next: Fence Previous: Accumulate Functions

    RMA communications fall in two categories:


    RMA communication calls with argument win must occur at a process only within an access epoch for win. Such an epoch starts with an RMA synchronization call on win; it proceeds with zero or more RMA communication calls ( MPI_PUT, MPI_GET or MPI_ACCUMULATE) on win; it completes with another synchronization call on win. This allows users to amortize one synchronization with multiple data transfers and provide implementors more flexibility in the implementation of RMA operations.

    Distinct access epochs for win at the same process must be disjoint. On the other hand, epochs pertaining to different win arguments may overlap. Local operations or other MPI calls may also occur during an epoch.

    In active target communication, a target window can be accessed by RMA operations only within an exposure epoch. Such an epoch is started and completed by RMA synchronization calls executed by the target process. Distinct exposure epochs at a process on the same window must be disjoint, but such an exposure epoch may overlap with exposure epochs on other windows or with access epochs for the same or other win arguments. There is a one-to-one matching between access epochs at origin processes and exposure epochs on target processes: RMA operations issued by an origin process for a target window will access that target window during the same exposure epoch if and only if they were issued during the same access epoch.

    In passive target communication the target process does not execute RMA synchronization calls, and there is no concept of an exposure epoch.

    MPI provides three synchronization mechanisms:

      1. The MPI_WIN_FENCE collective synchronization call supports a simple synchronization pattern that is often used in parallel computations: namely a loosely-synchronous model, where global computation phases alternate with global communication phases. This mechanism is most useful for loosely synchronous algorithms where the graph of communicating processes changes very frequently, or where each process communicates with many others.

      This call is used for active target communication. An access epoch at an origin process or an exposure epoch at a target process are started and completed by calls to MPI_WIN_FENCE. A process can access windows at all processes in the group of win during such an access epoch, and the local window can be accessed by all processes in the group of win during such an exposure epoch.
      2. The four functions MPI_WIN_START, MPI_WIN_COMPLETE, MPI_WIN_POST and MPI_WIN_WAIT can be used to restrict synchronization to the minimum: only pairs of communicating processes synchronize, and they do so only when a synchronization is needed to order correctly RMA accesses to a window with respect to local accesses to that same window. This mechanism may be more efficient when each process communicates with few (logical) neighbors, and the communication graph is fixed or changes infrequently.

      These calls are used for active target communication. An access epoch is started at the origin process by a call to MPI_WIN_START and is terminated by a call to MPI_WIN_COMPLETE. The start call has a group argument that specifies the group of target processes for that epoch. An exposure epoch is started at the target process by a call to MPI_WIN_POST and is completed by a call to MPI_WIN_WAIT. The post call has a group argument that specifies the set of origin processes for that epoch.
      3. Finally, shared and exclusive locks are provided by the two functions MPI_WIN_LOCK and MPI_WIN_UNLOCK. Lock synchronization is useful for MPI applications that emulate a shared memory model via MPI calls; e.g., in a ``billboard'' model, where processes can, at random times, access or update different parts of the billboard.

      These two calls provide passive target communication. An access epoch is started by a call to MPI_WIN_LOCK and terminated by a call to MPI_WIN_UNLOCK. Only one target window can be accessed during that epoch with win.

    Figure 1 illustrates the general synchronization pattern for active target communication.


    Figure 1: active target communication. Dashed arrows represent synchronizations (ordering of events).

    The synchronization between post and start ensures that the put call of the origin process does not start until the target process exposes the window (with the post call); the target process will expose the window only after preceding local accesses to the window have completed. The synchronization between complete and wait ensures that the put call of the origin process completes before the window is unexposed (with the wait call). The target process will execute following local accesses to the target window only after the wait returned.

    Figure 1 shows operations occurring in the natural temporal order implied by the synchronizations: the post occurs before the matching start, and complete occurs before the matching wait. However, such strong synchronization is more than needed for correct ordering of window accesses. The semantics of MPI calls allow weak synchronization, as illustrated in Figure 2 .


    Figure 2: active target communication, with weak synchronization. Dashed arrows represent synchronizations (ordering of events)

    The access to the target window is delayed until the window is exposed, after the post. However the start may complete earlier; the put and complete may also terminate earlier, if put data is buffered by the implementation. The synchronization calls order correctly window accesses, but do not necessarily synchronize other operations. This weaker synchronization semantic allows for more efficient implementations.

    Figure 3 illustrates the general synchronization pattern for passive target communication. The first origin process communicates data to the second origin process, through the memory of the target process; the target process is not explicitly involved in the communication.


    Figure 3: passive target communication. Dashed arrows represent synchronizations (ordering of events).

    The lock and unlock calls ensure that the two RMA accesses do not occur concurrently. However, they do not ensure that the put by origin 1 will precede the get by origin 2.



    Up: One-Sided Communications Next: Fence Previous: Accumulate Functions


    6.4.1. Fence


    Up: Synchronization Calls Next: General Active Target Synchronization Previous: Synchronization Calls

    MPI_WIN_FENCE(assert, win)
    IN assertprogram assertion (integer)
    IN winwindow object (handle)

    int MPI_Win_fence(int assert, MPI_Win win)

    MPI_WIN_FENCE(ASSERT, WIN, IERROR)
    INTEGER ASSERT, WIN, IERROR

    void MPI::Win::Fence(int assert) const

    The MPI call MPI_WIN_FENCE(assert, win) synchronizes RMA calls on win. The call is collective on the group of win. All RMA operations on win originating at a given process and started before the fence call will complete at that process before the fence call returns. They will be completed at their target before the fence call returns at the target. RMA operations on win started by a process after the fence call returns will access their target window only after MPI_WIN_FENCE has been called by the target process.

    The call completes an RMA access epoch if it was preceded by another fence call and the local process issued RMA communication calls on win between these two calls. The call completes an RMA exposure epoch if it was preceded by another fence call and the local window was the target of RMA accesses between these two calls. The call starts an RMA access epoch if it is followed by another fence call and by RMA communication calls issued between these two fence calls. The call starts an exposure epoch if it is followed by another fence call and the local window is the target of RMA accesses between these two fence calls. Thus, the fence call is equivalent to calls to a subset of post, start, complete, wait.

    A fence call usually entails a barrier synchronization: a process completes a call to MPI_WIN_FENCE only after all other processes in the group entered their matching call. However, a call to MPI_WIN_FENCE that is known not to end any epoch (in particular, a call with assert = MPI_MODE_NOPRECEDE) does not necessarily act as a barrier.

    The assert argument is used to provide assertions on the context of the call that may be used for various optimizations. This is described in Section Assertions . A value of assert = 0 is always valid.


    Advice to users.

    Calls to MPI_WIN_FENCE should both precede and follow calls to put, get or accumulate that are synchronized with fence calls. ( End of advice to users.)



    Up: Synchronization Calls Next: General Active Target Synchronization Previous: Synchronization Calls


    6.4.2. General Active Target Synchronization


    Up: Synchronization Calls Next: Lock Previous: Fence

    MPI_WIN_START(group, assert, win)
    IN groupgroup of target processes (handle)
    IN assertprogram assertion (integer)
    IN winwindow object (handle)

    int MPI_Win_start(MPI_Group group, int assert, MPI_Win win)

    MPI_WIN_START(GROUP, ASSERT, WIN, IERROR)
    INTEGER GROUP, ASSERT, WIN, IERROR

    void MPI::Win::Start(const MPI::Group& group, int assert) const

    Starts an RMA access epoch for win. RMA calls issued on win during this epoch must access only windows at processes in group. Each process in group must issue a matching call to MPI_WIN_POST. RMA accesses to each target window will be delayed, if necessary, until the target process executed the matching call to MPI_WIN_POST. MPI_WIN_START is allowed to block until the corresponding MPI_WIN_POST calls are executed, but is not required to.

    The assert argument is used to provide assertions on the context of the call that may be used for various optimizations. This is described in Section Assertions . A value of assert = 0 is always valid.

    MPI_WIN_COMPLETE(win)
    IN winwindow object (handle)

    int MPI_Win_complete(MPI_Win win)

    MPI_WIN_COMPLETE(WIN, IERROR)
    INTEGER WIN, IERROR

    void MPI::Win::Complete() const

    Completes an RMA access epoch on win started by a call to MPI_WIN_START. All RMA communication calls issued on win during this epoch will have completed at the origin when the call returns.

    MPI_WIN_COMPLETE enforces completion of preceding RMA calls at the origin, but not at the target. A put or accumulate call may not have completed at the target when it has completed at the origin.

    Consider the sequence of calls in the example below.
    Example


    MPI_Win_start(group, flag, win); 
    MPI_Put(...,win); 
    MPI_Win_complete(win); 
    

    The call to MPI_WIN_COMPLETE does not return until the put call has completed at the origin; and the target window will be accessed by the put operation only after the call to MPI_WIN_START has matched a call to MPI_WIN_POST by the target process. This still leaves much choice to implementors. The call to MPI_WIN_START can block until the matching call to MPI_WIN_POST occurs at all target processes. One can also have implementations where the call to MPI_WIN_START is nonblocking, but the call to MPI_PUT blocks until the matching call to MPI_WIN_POST occurred; or implementations where the first two calls are nonblocking, but the call to MPI_WIN_COMPLETE blocks until the call to MPI_WIN_POST occurred; or even implementations where all three calls can complete before any target process called MPI_WIN_POST --- the data put must be buffered, in this last case, so as to allow the put to complete at the origin ahead of its completion at the target. However, once the call to MPI_WIN_POST is issued, the sequence above must complete, without further dependencies.

    MPI_WIN_POST(group, assert, win)
    IN groupgroup of origin processes (handle)
    IN assertprogram assertion (integer)
    IN winwindow object (handle)

    int MPI_Win_post(MPI_Group group, int assert, MPI_Win win)

    MPI_WIN_POST(GROUP, ASSERT, WIN, IERROR)
    INTEGER GROUP, ASSERT, WIN, IERROR

    void MPI::Win::Post(const MPI::Group& group, int assert) const

    Starts an RMA exposure epoch for the local window associated with win. Only processes in group should access the window with RMA calls on win during this epoch. Each process in group must issue a matching call to MPI_WIN_START. MPI_WIN_POST does not block.

    MPI_WIN_WAIT(win)
    IN winwindow object (handle)

    int MPI_Win_wait(MPI_Win win)

    MPI_WIN_WAIT(WIN, IERROR)
    INTEGER WIN, IERROR

    void MPI::Win::Wait() const

    Completes an RMA exposure epoch started by a call to MPI_WIN_POST on win. This call matches calls to MPI_WIN_COMPLETE(win) issued by each of the origin processes that were granted access to the window during this epoch. The call to MPI_WIN_WAIT will block until all matching calls to MPI_WIN_COMPLETE have occurred. This guarantees that all these origin processes have completed their RMA accesses to the local window. When the call returns, all these RMA accesses will have completed at the target window.

    Figure 4 illustrates the use of these four functions.


    Figure 4: active target communication. Dashed arrows represent synchronizations and solid arrows represent data transfer.

    Process 0 puts data in the windows of processes 1 and 2 and process 3 puts data in the window of process 2. Each start call lists the ranks of the processes whose windows will be accessed; each post call lists the ranks of the processes that access the local window. The figure illustrates a possible timing for the events, assuming strong synchronization; in a weak synchronization, the start, put or complete calls may occur ahead of the matching post calls.

    MPI_WIN_TEST(win, flag)
    IN winwindow object (handle)
    OUT flagsuccess flag (logical)

    int MPI_Win_test(MPI_Win win, int *flag)

    MPI_WIN_TEST(WIN, FLAG, IERROR)
    INTEGER WIN, IERROR
    LOGICAL FLAG

    bool MPI::Win::Test() const

    This is the nonblocking version of MPI_WIN_WAIT. It returns flag = true if MPI_WIN_WAIT would return, flag = false, otherwise. The effect of return of MPI_WIN_TEST with flag = true is the same as the effect of a return of MPI_WIN_WAIT. If flag = false is returned, then the call has no visible effect.

    MPI_WIN_TEST should be invoked only where MPI_WIN_WAIT can be invoked. Once the call has returned flag = true, it must not be invoked anew, until the window is posted anew.

    Assume that window win is associated with a ``hidden'' communicator wincomm, used for communication by the processes of win. The rules for matching of post and start calls and for matching complete and wait call can be derived from the rules for matching sends and receives, by considering the following (partial) model implementation.

    { MPI_WIN_POST(group,0,win)}
    initiate a nonblocking send with tag tag0 to each process in group, using wincomm. No need to wait for the completion of these sends.
    { MPI_WIN_START(group,0,win)}
    initiate a nonblocking receive with tag tag0 from each process in group, using wincomm. An RMA access to a window in target process i is delayed until the receive from i is completed.
    { MPI_WIN_COMPLETE(win)}
    initiate a nonblocking send with tag tag1 to each process in the group of the preceding start call. No need to wait for the completion of these sends.
    { MPI_WIN_WAIT(win)}
    initiate a nonblocking receive with tag tag1 from each process in the group of the preceding post call. Wait for the completion of all receives.

    No races can occur in a correct program: each of the sends matches a unique receive, and vice-versa.


    Rationale.

    The design for general active target synchronization requires the user to provide complete information on the communication pattern, at each end of a communication link: each origin specifies a list of targets, and each target specifies a list of origins. This provides maximum flexibility (hence, efficiency) for the implementor: each synchronization can be initiated by either side, since each ``knows'' the identity of the other. This also provides maximum protection from possible races. On the other hand, the design requires more information than RMA needs, in general: in general, it is sufficient for the origin to know the rank of the target, but not vice versa. Users that want more ``anonymous'' communication will be required to use the fence or lock mechanisms. ( End of rationale.)

    Advice to users.

    Assume a communication pattern that is represented by a directed graph , where V = {0, ..., n-1} and if origin process i accesses the window at target process j. Then each process i issues a call to MPI_WIN_POST(ingroupi, ...), followed by a call to MPI_WIN_START(outgroupi,...), where and . A call is a noop, and can be skipped, if the group argument is empty. After the communications calls, each process that issued a start will issue a complete. Finally, each process that issued a post will issue a wait.

    Note that each process may call with a group argument that has different members. ( End of advice to users.)



    Up: Synchronization Calls Next: Lock Previous: Fence


    6.4.3. Lock


    Up: Synchronization Calls Next: Assertions Previous: General Active Target Synchronization

    MPI_WIN_LOCK(lock_type, rank, assert, win)
    IN lock_typeeither MPI_LOCK_EXCLUSIVE or MPI_LOCK_SHARED (state)
    IN rankrank of locked window (nonnegative integer)
    IN assertprogram assertion (integer)
    IN winwindow object (handle)

    int MPI_Win_lock(int lock_type, int rank, int assert, MPI_Win win)

    MPI_WIN_LOCK(LOCK_TYPE, RANK, ASSERT, WIN, IERROR)
    INTEGER LOCK_TYPE, RANK, ASSERT, WIN, IERROR

    void MPI::Win::Lock(int lock_type, int rank, int assert) const

    Starts an RMA access epoch. Only the window at the process with rank rank can be accessed by RMA operations on win during that epoch.

    MPI_WIN_UNLOCK(rank, win)
    IN rankrank of window (nonnegative integer)
    IN winwindow object (handle)

    int MPI_Win_unlock(int rank, MPI_Win win)

    MPI_WIN_UNLOCK(RANK, WIN, IERROR)
    INTEGER RANK, WIN, IERROR

    void MPI::Win::Unlock(int rank) const

    Completes an RMA access epoch started by a call to MPI_WIN_LOCK(...,win). RMA operations issued during this period will have completed both at the origin and at the target when the call returns.

    Locks are used to protect accesses to the locked target window effected by RMA calls issued between the lock and unlock call, and to protect local load/store accesses to a locked local window executed between the lock and unlock call. Accesses that are protected by an exclusive lock will not be concurrent at the window site with other accesses to the same window that are lock protected. Accesses that are protected by a shared lock will not be concurrent at the window site with accesses protected by an exclusive lock to the same window.

    It is erroneous to have a window locked and exposed (in an exposure epoch) concurrently. I.e., a process may not call MPI_WIN_LOCK to lock a target window if the target process has called MPI_WIN_POST and has not yet called MPI_WIN_WAIT; it is erroneous to call MPI_WIN_POST while the local window is locked.


    Rationale.

    An alternative is to require MPI to enforce mutual exclusion between exposure epochs and locking periods. But this would entail additional overheads when locks or active target synchronization do not interact in support of those rare interactions between the two mechanisms. The programming style that we encourage here is that a set of windows is used with only one synchronization mechanism at a time, with shifts from one mechanism to another being rare and involving global synchronization. ( End of rationale.)

    Advice to users.

    Users need to use explicit synchronization code in order to enforce mutual exclusion between locking periods and exposure epochs on a window. ( End of advice to users.)

    Implementors may restrict the use of RMA communication that is synchronized by lock calls to windows in memory allocated by MPI_ALLOC_MEM (Section Memory Allocation ). Locks can be used portably only in such memory.


    Rationale.

    The implementation of passive target communication when memory is not shared requires an asynchronous agent. Such an agent can be implemented more easily, and can achieve better performance, if restricted to specially allocated memory. It can be avoided altogether if shared memory is used. It seems natural to impose restrictions that allows one to use shared memory for 3-rd party communication in shared memory machines.

    The downside of this decision is that passive target communication cannot be used without taking advantage of nonstandard Fortran features: namely, the availability of C-like pointers; these are not supported by some Fortran compilers (g77 and Windows/NT compilers, at the time of writing). Also, passive target communication cannot be portably targeted to COMMON blocks, or other statically declared Fortran arrays. ( End of rationale.)
    Consider the sequence of calls in the example below.
    Example


    MPI_Win_lock(MPI_LOCK_EXCLUSIVE, rank, assert, win) 
    MPI_Put(..., rank, ..., win) 
    MPI_Win_unlock(rank, win) 
    

    The call to MPI_WIN_UNLOCK will not return until the put transfer has completed at the origin and at the target. This still leaves much freedom to implementors. The call to MPI_WIN_LOCK may block until an exclusive lock on the window is acquired; or, the call MPI_WIN_LOCK may not block, while the call to MPI_PUT blocks until a lock is acquired; or, the first two calls may not block, while MPI_WIN_UNLOCK blocks until a lock is acquired --- the update of the target window is then postponed until the call to MPI_WIN_UNLOCK occurs. However, if the call to MPI_WIN_LOCK is used to lock a local window, then the call must block until the lock is acquired, since the lock may protect local load/store accesses to the window issued after the lock call returns.



    Up: Synchronization Calls Next: Assertions Previous: General Active Target Synchronization


    6.4.4. Assertions


    Up: Synchronization Calls Next: Miscellaneous Clarifications Previous: Lock

    The assert argument in the calls MPI_WIN_POST, MPI_WIN_START, MPI_WIN_FENCE and MPI_WIN_LOCK is used to provide assertions on the context of the call that may be used to optimize performance. The assert argument does not change program semantics if it provides correct information on the program --- it is erroneous to provides incorrect information. Users may always provide assert = 0 to indicate a general case, where no guarantees are made.


    Advice to users.

    Many implementations may not take advantage of the information in assert; some of the information is relevant only for noncoherent, shared memory machines. Users should consult their implementation manual to find which information is useful on each system. On the other hand, applications that provide correct assertions whenever applicable are portable and will take advantage of assertion specific optimizations, whenever available. ( End of advice to users.)

    Advice to implementors.

    Implementations can always ignore the assert argument. Implementors should document which assert values are significant on their implementation. ( End of advice to implementors.)
    assert is the bit-vector OR of zero or more of the following integer constants: MPI_MODE_NOCHECK, MPI_MODE_NOSTORE, MPI_MODE_NOPUT, MPI_MODE_NOPRECEDE and MPI_MODE_NOSUCCEED. The significant options are listed below, for each call.


    Advice to users.

    C/C++ users can use bit vector or ( ) to combine these constants; Fortran 90 users can use the bit-vector IOR intrinsic. Fortran 77 users can use (nonportably) bit vector IOR on systems that support it. Alternatively, Fortran users can portably use integer addition to OR the constants (each constant should appear at most once in the addition!). ( End of advice to users.)

    MPI_WIN_START:
    MPI_MODE_NOCHECK --- the matching calls to MPI_WIN_POST have already completed on all target processes when the call to MPI_WIN_START is made. The nocheck option can be specified in a start call if and only if it is specified in each matching post call. This is similar to the optimization of ``ready-send'' that may save a handshake when the handshake is implicit in the code. (However, ready-send is matched by a regular receive, whereas both start and post must specify the nocheck option.)

    MPI_WIN_POST:
    MPI_MODE_NOCHECK --- the matching calls to MPI_WIN_START have not yet occurred on any origin processes when the call to MPI_WIN_POST is made. The nocheck option can be specified by a post call if and only if it is specified by each matching start call.
    MPI_MODE_NOSTORE --- the local window was not updated by local stores (or local get or receive calls) since last synchronization. This may avoid the need for cache synchronization at the post call.
    MPI_MODE_NOPUT --- the local window will not be updated by put or accumulate calls after the post call, until the ensuing (wait) synchronization. This may avoid the need for cache synchronization at the wait call.

    MPI_WIN_FENCE:
    MPI_MODE_NOSTORE --- the local window was not updated by local stores (or local get or receive calls) since last synchronization.
    MPI_MODE_NOPUT --- the local window will not be updated by put or accumulate calls after the fence call, until the ensuing (fence) synchronization.
    MPI_MODE_NOPRECEDE --- the fence does not complete any sequence of locally issued RMA calls. If this assertion is given by any process in the window group, then it must be given by all processes in the group.
    { MPI_MODE_NOSUCCEED}
    --- the fence does not start any sequence of locally issued RMA calls. If the assertion is given by any process in the window group, then it must be given by all processes in the group.

    MPI_WIN_LOCK:
    MPI_MODE_NOCHECK --- no other process holds, or will attempt to acquire a conflicting lock, while the caller holds the window lock. This is useful when mutual exclusion is achieved by other means, but the coherence operations that may be attached to the lock and unlock calls are still required.



    Advice to users.

    Note that the nostore and noprecede flags provide information on what happened before the call; the noput and nosucceed flags provide information on what will happen after the call. ( End of advice to users.)



    Up: Synchronization Calls Next: Miscellaneous Clarifications Previous: Lock


    6.4.5. Miscellaneous Clarifications


    Up: Synchronization Calls Next: Examples Previous: Assertions

    Once an RMA routine completes, it is safe to free any opaque objects passed as argument to that routine. For example, the datatype argument of a MPI_PUT call can be freed as soon as the call returns, even though the communication may not be complete.

    As in message passing, datatypes must be committed before they can be used in RMA communication.



    Up: Synchronization Calls Next: Examples Previous: Assertions


    6.5. Examples


    Up: One-Sided Communications Next: Error Handling Previous: Miscellaneous Clarifications


    Example The following example shows a generic loosely synchronous, iterative code, using fence synchronization. The window at each process consists of array A, which contains the origin and target buffers of the put calls.


    ... 
    while(!converged(A)){ 
      update(A); 
      MPI_Win_fence(MPI_MODE_NOPRECEDE, win); 
      for(i=0; i < toneighbors; i++) 
        MPI_Put(&frombuf[i], 1, fromtype[i], toneighbor[i], 
                             todisp[i], 1, totype[i], win); 
      MPI_Win_fence((MPI_MODE_NOSTORE | MPI_MODE_NOSUCCEED), win); 
      } 
    
    The same code could be written with get, rather than put. Note that, during the communication phase, each window is concurrently read (as origin buffer of puts) and written (as target buffer of puts). This is OK, provided that there is no overlap between the target buffer of a put and another communication buffer.


    Example Same generic example, with more computation/communication overlap. We assume that the update phase is broken in two subphases: the first, where the ``boundary,'' which is involved in communication, is updated, and the second, where the ``core,'' which neither use nor provide communicated data, is updated.

    ... 
    while(!converged(A)){ 
      update_boundary(A); 
      MPI_Win_fence((MPI_MODE_NOPUT | MPI_MODE_NOPRECEDE), win); 
      for(i=0; i < fromneighbors; i++) 
        MPI_Get(&tobuf[i], 1, totype[i], fromneighbor[i], 
                        fromdisp[i], 1, fromtype[i], win); 
      update_core(A); 
      MPI_Win_fence(MPI_MODE_NOSUCCEED, win); 
      } 
    
    The get communication can be concurrent with the core update, since they do not access the same locations, and the local update of the origin buffer by the get call can be concurrent with the local update of the core by the update_core call. In order to get similar overlap with put communication we would need to use separate windows for the core and for the boundary. This is required because we do not allow local stores to be concurrent with puts on the same, or on overlapping, windows.


    Example Same code as in Example Examples , rewritten using post-start-complete-wait.

    ... 
    while(!converged(A)){ 
      update(A); 
      MPI_Win_post(fromgroup, 0, win); 
      MPI_Win_start(togroup, 0, win); 
      for(i=0; i < toneighbors; i++) 
        MPI_Put(&frombuf[i], 1, fromtype[i], toneighbor[i], 
                             todisp[i], 1, totype[i], win); 
      MPI_Win_complete(win); 
      MPI_Win_wait(win); 
      } 
    


    Example Same example, with split phases, as in Example Examples .

    ... 
    while(!converged(A)){ 
      update_boundary(A); 
      MPI_Win_post(togroup, MPI_MODE_NOPUT, win); 
      MPI_Win_start(fromgroup, 0, win); 
      for(i=0; i < fromneighbors; i++) 
        MPI_Get(&tobuf[i], 1, totype[i], fromneighbor[i], 
                       fromdisp[i], 1, fromtype[i], win); 
      update_core(A); 
      MPI_Win_complete(win); 
      MPI_Win_wait(win); 
      } 
    


    Example A checkerboard, or double buffer communication pattern, that allows more computation/communication overlap. Array A0 is updated using values of array A1, and vice versa. We assume that communication is symmetric: if process A gets data from process B, then process B gets data from process A. Window wini consists of array Ai.

    ... 
    if (!converged(A0,A1)) 
      MPI_Win_post(neighbors, (MPI_MODE_NOCHECK | MPI_MODE_NOPUT), win0); 
    MPI_Barrier(comm0); 
    /* the barrier is needed because the start call inside the 
    loop uses the nocheck option */ 
    while(!converged(A0, A1)){ 
      /* communication on A0 and computation on A1 */ 
      update2(A1, A0); /* local update of A1 that depends on A0 (and A1) */ 
      MPI_Win_start(neighbors, MPI_MODE_NOCHECK, win0); 
      for(i=0; i < neighbors; i++) 
        MPI_Get(&tobuf0[i], 1, totype0[i], neighbor[i], 
                   fromdisp0[i], 1, fromtype0[i], win0); 
      update1(A1); /* local update of A1 that is 
                      concurrent with communication that updates A0 */  
      MPI_Win_post(neighbors, (MPI_MODE_NOCHECK | MPI_MODE_NOPUT), win1); 
      MPI_Win_complete(win0); 
      MPI_Win_wait(win0); 
     
      /* communication on A1 and computation on A0 */ 
      update2(A0, A1); /* local update of A0 that depends on A1 (and A0)*/ 
      MPI_Win_start(neighbors, MPI_MODE_NOCHECK, win1); 
      for(i=0; i < neighbors; i++) 
        MPI_Get(&tobuf1[i], 1, totype1[i], neighbor[i], 
                    fromdisp1[i], 1, fromtype1[i], win1); 
      update1(A0); /* local update of A0 that depends on A0 only, 
                     concurrent with communication that updates A1 */ 
      if (!converged(A0,A1)) 
        MPI_Win_post(neighbors, (MPI_MODE_NOCHECK | MPI_MODE_NOPUT), win0); 
      MPI_Win_complete(win1); 
      MPI_Win_wait(win1); 
      } 
    
    A process posts the local window associated with win0 before it completes RMA accesses to the remote windows associated with win1. When the wait