Using MPI and Using Advanced MPI

Examples


Using MPI, 3rd Edition at The MIT Press
Using Advanced MPI at the MIT Press
Picture of Book Covers
These two books, published in 2014, show how to use MPI, the Message Passing Interface, to write parallel programs. Using MPI, now in its 3rd edition, provides an introduction to using MPI, including examples of the parallel computing code needed for simulations of partial differential equations and n-body problems. Using Advanced MPI covers additional features of MPI, including parallel I/O, one-sided or remote memory access communcication, and using threads and shared memory from MPI.

What is MPI?

MPI, the Message-Passing Interface, is an application programmer interface (API) for programming parallel computers. It was first released in 1992 and transformed scientific parallel computing. Today, MPI is widely using on everything from laptops (where it makes it easy to develop and debug) to the world's largest and fastest computers. Among the reasons for the the success of MPI is its focus on performance, scalability, and support for building tools and libraries that extend the power of MPI.

Examples

Errata

News and Reviews

  • BLOG entry by Torsten Hoefler, one of the authors of Using Advanced MPI.

Tables of Contents

Using MPI 3rd EditionUsing Advanced MPI
   Series Forewordxiii
   Preface to the Third Editionxv
   Preface to the Second Editionxix
   Preface to the First Editionxxi
1Background1
1.1   Why Parallel Computing?1
1.2   Obstacles to Progress2
1.3   Why Message Passing?3
1.3.1   Parallel Computational Models3
1.3.2   Advantages of the Message-Passing Model9
1.4   Evolution of Message-Passing Systems10
1.5   The MPI Forum11
2Introduction to MPI13
2.1   Goal13
2.2   What Is MPI?13
2.3   Basic MPI Concepts14
2.4   Other Interesting Features of MPI18
2.5   Is MPI Large or Small?20
2.6   Decisions Left to the Implementor21
3Using MPI in Simple Programs23
3.1   A First MPI Program23
3.2   Running Your First MPI Program28
3.3   A First MPI Program in C29
3.4   Using MPI from Other Languages29
3.5   Timing MPI Programs31
3.6   A Self-Scheduling Example: Matrix-Vector Multiplication32
3.7   Studying Parallel Performance38
3.7.1   Elementary Scalability Calculations39
3.7.2   Gathering Data on Program Execution41
3.7.3   Instrumenting a Parallel Program with MPE Logging42
3.7.4   Events and States43
3.7.5   Instrumenting the Matrix-Matrix Multiply Program43
3.7.6   Notes on Implementation of Logging47
3.7.7   Graphical Display of Logfiles48
3.8   Using Communicators49
3.9   Another Way of Forming New Communicators55
3.10   A Handy Graphics Library for Parallel Programs57
3.11   Common Errors and Misunderstandings60
3.12   Summary of a Simple Subset of MPI62
3.13   Application: Computational Fluid Dynamics62
3.13.1   Parallel Formulation63
3.13.2   Parallel Implementation65
4Intermediate MPI69
4.1   The Poisson Problem70
4.2   Topologies73
4.3   A Code for the Poisson Problem81
4.4   Using Nonblocking Communications91
4.5   Synchronous Sends and “Safe” Programs94
4.6   More on Scalability95
4.7   Jacobi with a 2-D Decomposition98
4.8   An MPI Derived Datatype100
4.9   Overlapping Communication and Computation101
4.10   More on Timing Programs105
4.11   Three Dimensions106
4.12   Common Errors and Misunderstandings107
4.13   Application: Nek5000/NekCEM108
5Fun with Datatypes113
5.1   MPI Datatypes113
5.1.1   Basic Datatypes and Concepts113
5.1.2   Derived Datatypes116
5.1.3   Understanding Extents118
5.2   The N-Body Problem119
5.2.1   Gather120
5.2.2   Nonblocking Pipeline124
5.2.3   Moving Particles between Processes127
5.2.4   Sending Dynamically Allocated Data132
5.2.5   User-Controlled Data Packing134
5.3   Visualizing the Mandelbrot Set136
5.4   Gaps in Datatypes146
5.5   More on Datatypes for Structures148
5.6   Deprecated and Removed Functions149
5.7   Common Errors and Misunderstandings150
5.8   Application: Cosmological Large-Scale Structure Formation152
5.3.1   Sending Arrays of Structures144
6Parallel Libraries155
6.1   Motivation155
6.1.1   The Need for Parallel Libraries155
6.1.2   Common Deficiencies of Early Message-Passing Systems156
6.1.3   Review of MPI Features That Support Libraries158
6.2   A First MPI Library161
6.3   Linear Algebra on Grids170
6.3.1   Mappings and Logical Grids170
6.3.2   Vectors and Matrices175
6.3.3   Components of a Parallel Library177
6.4   The LINPACK Benchmark in MPI179
6.5   Strategies for Library Building183
6.6   Examples of Libraries184
6.7   Application: Nuclear Green’s Function Monte Carlo185
7Other Features of MPI189
7.1   Working with Global Data189
7.1.1   Shared Memory, Global Data, and Distributed Memory189
7.1.2   A Counter Example190
7.1.3   The Shared Counter Using Polling Instead of an Extra Process193
7.1.4   Fairness in Message Passing196
7.1.5   Exploiting Request-Response Message Patterns198
7.2   Advanced Collective Operations201
7.2.1   Data Movement201
7.2.2   Collective Computation201
7.2.3   Common Errors and Misunderstandings206
7.3   Intercommunicators208
7.4   Heterogeneous Computing216
7.5   Hybrid Programming with MPI and OpenMP217
7.6   The MPI Profiling Interface218
7.6.1   Finding Buffering Problems221
7.6.2   Finding Load Imbalances223
7.6.3   Mechanics of Using the Profiling Interface223
7.7   Error Handling226
7.7.1   Error Handlers226
7.7.2   Example of Error Handling229
7.7.3   User-Defined Error Handlers229
7.7.4   Terminating MPI Programs232
7.7.5   Common Errors and Misunderstandings232
7.8   The MPI Environment234
7.8.1   Processor Name236
7.8.2   Is MPI Initialized?236
7.9   Determining the Version of MPI237
7.10   Other Functions in MPI239
7.11   Application: No-Core Configuration Interaction Calculations in Nuclear Physics240
8Understanding How MPI Implementations Work245
8.1   Introduction245
8.1.1   Sending Data245
8.1.2   Receiving Data246
8.1.3   Rendezvous Protocol246
8.1.4   Matching Protocols to MPI’s Send Modes247
8.1.5   Performance Implications248
8.1.6   Alternative MPI Implementation Strategies249
8.1.7   Tuning MPI Implementations249
8.2   How Difficult Is MPI to Implement?249
8.3   Device Capabilities and the MPI Library Definition250
8.4   Reliability of Data Transfer251
9Comparing MPI with Sockets253
9.1   Process Startup and Shutdown255
9.2   Handling Faults257
10Wait! There’s More!259
10.1   Beyond MPI-1259
10.2   Using Advanced MPI260
10.3   Will There Be an MPI-4?261
10.4   Beyond Message Passing Altogether261
10.5   Final Words262
   Glossary of Selected Terms263
A   The MPE Multiprocessing Environment273
   A.1 MPE Logging273
   A.2 MPE Graphics275
   A.3 MPE Helpers276
B   MPI Resources Online279
C   Language Details281
   C.1 Arrays in C and Fortran281
   C.1.1 Column and Row Major Ordering281
   C.1.2 Meshes vs. Matrices281
   C.1.3 Higher Dimensional Arrays282
   C.2 Aliasing285
   References287
   Subject Index301
   Function and Term Index305
 
   Series Forewordxv
   Forewordxvii
   Prefacexix
1Introduction1
1.1   MPI-1 and MPI-21
1.2   MPI-32
1.3   Parallelism and MPI3
1.3.1   Conway’s Game of Life4
1.3.2   Poisson Solver5
1.4   Passing Hints to the MPI Implementation with MPI_Info11
1.4.1   Motivation, Description, and Rationale12
1.4.2   An Example from Parallel I/O12
1.5   Organization of This Book13
2Working with Large-Scale Systems15
2.1   Nonblocking Collectives16
2.1.1   Example: 2-D FFT16
2.1.2   Example: Five-Point Stencil19
2.1.3   Matching, Completion, and Progression20
2.1.4   Restrictions22
2.1.5   Collective Software Pipelining23
2.1.6   A Nonblocking Barrier?27
2.1.7   Nonblocking Allreduce and Krylov Methods30
2.2   Distributed Graph Topologies31
2.2.1   Example: The Peterson Graph37
2.2.2   Edge Weights37
2.2.3   Graph Topology Info Argument39
2.2.4   Process Reordering39
2.3   Collective Operations on Process Topologies40
2.3.1   Neighborhood Collectives41
2.3.2   Vector Neighborhood Collectives44
2.3.3   Nonblocking Neighborhood Collectives45
2.4   Advanced Communicator Creation48
2.4.1   Nonblocking Communicator Duplication48
2.4.2   Noncollective Communicator Creation50
3Introduction to Remote Memory Operations55
3.1   Introduction57
3.2   Contrast with Message Passing59
3.3   Memory Windows62
3.3.1   Hints on Choosing Window Parameters64
3.3.2   Relationship to Other Approaches65
3.4   Moving Data65
3.4.1   Reasons for Using Displacement Units69
3.4.2   Cautions in Using Displacement Units70
3.4.3   Displacement Sizes in Fortran71
3.5   Completing RMA Data Transfers71
3.6   Examples of RMA Operations73
3.6.1   Mesh Ghost Cell Communication74
3.6.2   Combining Communication and Computation84
3.7   Pitfalls in Accessing Memory88
3.7.1   Atomicity of Memory Operations89
3.7.2   Memory Coherency90
3.7.3   Some Simple Rules for RMA91
3.7.4   Overlapping Windows93
3.7.5   Compiler Optimizations93
3.8   Performance Tuning for RMA Operations95
3.8.1   Options for MPI_Win_create95
3.8.2   Options for MPI_Win_fence97
4Advanced Remote Memory Access101
4.1   Passive Target Synchronization101
4.2   Implementing Blocking, Independent RMA Operations102
4.3   Allocating Memory for MPI Windows104
4.3.1   Using MPI_Alloc_mem and MPI_Win_allocate from C104
4.3.2   Using MPI_Alloc_mem and MPI_Win_allocate from Fortran 2008105
4.3.3   Using MPI_ALLOC_MEM and MPI_WIN_ALLOCATE from Older Fortran107
4.4   Another Version of NXTVAL108
4.4.1   The Nonblocking Lock110
4.4.2   NXTVAL with MPI_Fetch_and_op110
4.4.3   Window Attributes112
4.5   An RMA Mutex115
4.6   Global Arrays120
4.6.1   Create and Free122
4.6.2   Put and Get124
4.6.3   Accumulate127
4.6.4   The Rest of Global Arrays128
4.7   A Better Mutex130
4.8   Managing a Distributed Data Structure131
4.8.1   A Shared-Memory Distributed List Implementation132
4.8.2   An MPI Implementation of a Distributed List135
4.8.3   Inserting into a Distributed List140
4.8.4   An MPI Implementation of a Dynamic Distributed List143
4.8.5   Comments on More Concurrent List Implementations145
4.9   Compiler Optimization and Passive Targets148
4.10   MPI RMA Memory Models149
4.11   Scalable Synchronization152
4.11.1   Exposure and Access Epochs152
4.11.2   The Ghost-Point Exchange Revisited153
4.11.3   Performance Optimizations for Scalable Synchronization155
4.12   Summary156
5Using Shared Memory with MPI157
5.1   Using MPI Shared Memory159
5.1.1   Shared On-Node Data Structures159
5.1.2   Communication through Shared Memory160
5.1.3   Reducing the Number of Subdomains163
5.2   Allocating Shared Memory163
5.3   Address Calculation165
6Hybrid Programming169
6.1   Background169
6.2   Thread Basics and Issues170
6.2.1   Thread Safety171
6.2.2   Performance Issues with Threads172
6.2.3   Threads and Processes173
6.3   MPI and Threads173
6.4   Yet Another Version of NXTVAL176
6.5   Nonblocking Version of MPI_Comm_accept178
6.6   Hybrid Programming with MPI179
6.7   MPI Message and Thread-Safe Probe182
7Parallel I/O187
7.1   Introduction187
7.2   Using MPI for Simple I/O187
7.2.1   Using Individual File Pointers187
7.2.2   Using Explicit Offsets191
7.2.3   Writing to a File194
7.3   Noncontiguous Accesses and Collective I/O195
7.3.1   Noncontiguous Accesses195
7.3.2   Collective I/O199
7.4   Accessing Arrays Stored in Files203
7.4.1   Distributed Arrays204
7.4.2   A Word of Warning about Darray206
7.4.3   Subarray Datatype Constructor207
7.4.4   Local Array with Ghost Area210
7.4.5   Irregularly Distributed Arrays211
7.5   Nonblocking I/O and Split Collective I/O215
7.6   Shared File Pointers216
7.7   Passing Hints to the Implementation219
7.8   Consistency Semantics221
7.8.1   Simple Cases224
7.8.2   Accessing a Common File Opened with MPI_COMM_WORLD224
7.8.3   Accessing a Common File Opened with MPI_COMM_SELF227
7.8.4   General Recommendation228
7.9   File Interoperability229
7.9.1   File Structure229
7.9.2   File Data Representation230
7.9.3   Use of Datatypes for Portability231
7.9.4   User-Defined Data Representations233
7.10   Achieving High I/O Performance with MPI234
7.10.1   The Four “Levels” of Access234
7.10.2   Performance Results237
7.11   An Example Application238
7.12   Summary242
8Coping with Large Data243
8.1   MPI Support for Large Data243
8.2   Using Derived Datatypes243
8.3   Example244
8.4   Limitations of This Approach245
8.4.1   Collective Reduction Functions245
8.4.2   Irregular Collectives246
9Support for Performance and Correctness Debugging249
9.1   The Tools Interface250
9.1.1   Control Variables251
9.1.2   Performance Variables257
9.2   Info, Assertions, and MPI Objects263
9.3   Debugging and the MPIR Debugger Interface267
9.4   Summary269
10Dynamic Process Management271
10.1   Intercommunicators271
10.2   Creating New MPI Processes271
10.2.1   Parallel cp: A Simple System Utility272
10.2.2   Matrix-Vector Multiplication Example279
10.2.3   Intercommunicator Collective Operations284
10.2.4   Intercommunicator Point-to-Point Communication285
10.2.5   Finding the Number of Available Processes285
10.2.6   Passing Command-Line Arguments to Spawned Programs290
10.3   Connecting MPI Processes291
10.3.1   Visualizing the Computation in an MPI Program292
10.3.2   Accepting Connections from Other Programs294
10.3.3   Comparison with Sockets296
10.3.4   Moving Data between Groups of Processes298
10.3.5   Name Publishing299
10.4   Design of the MPI Dynamic Process Routines302
10.4.1   Goals for MPI Dynamic Process Management302
10.4.2   What MPI Did Not Standardize303
11Working with Modern Fortran305
11.1   The mpi_f08 Module305
11.2   Problems with the Fortran Interface306
11.2.1   Choice Parameters in Fortran307
11.2.2   Nonblocking Routines in Fortran308
11.2.3   Array Sections310
11.2.4   Trouble with LOGICAL311
12Features for Libraries313
12.1   External Interface Functions313
12.1.1   Decoding Datatypes313
12.1.2   Generalized Requests315
12.1.3   Adding New Error Codes and Classes322
12.2   Mixed-Language Programming324
12.3   Attribute Caching327
12.4   Using Reduction Operations Locally331
12.5   Error Handling333
12.5.1   Error Handlers333
12.5.2   Error Codes and Classes335
12.6   Topics Not Covered in This Book335
13Conclusions341
13.1   MPI Implementation Status341
13.2   Future Versions of the MPI Standard341
13.3   MPI at Exascale342
   MPI Resources on the World Wide Web343
   References345
   Subject Index353
   Function and Term Index359