4.4 Performance Analysis of Threaded MPI Application

The goal of this section is to provide an example how Jumpshot's ViewMap can be used to study the performance of threaded MPI applications. Let's say we are interested to find out the performance of different MPI implementations in a threaded environment. We will use a simple mult-threaded MPI program to see if there is any preformance difference. The test program, pthread_sendrecv.c, used here first creates multiple threads in each process. Each spawned thread then duplicates the MPI_COMM_WORLD to form a ring, i.e. each thread sends a message to its next rank and receives a message from its previous rank within same duplicated MPI_COMM_WORLD to form a ring. The program is shown at the end of the document. MPE is built with -enable-threadlogging^4.4 and -disable-safePMPI^4.5. The most accessible MPI implementations with MPI_THREAD_MULTIPLE support are MPICH2 and OpenMPI. We will use the latest stable release of MPICH2, 1.0.5p4, and OpenMPI, 1.2.3 for this demonstration. Since OpenMPI has the option to enable progress thread in additional to the standard thread support, we will build 2 different versions of OpenMPIs for this little experiment. The experiment will be performed on 4 AMD64 nodes running Linux Ubuntu 7.04, each node consists of 4 cores and the test program will be running with 1 to 6 extra threads to see if the oversubscribing has any effect on the send and receive performance.

Table 4.2 shows the total duration of the 4-process run with various numbers of child threads. The data shows that as the number of child threads increases, so is the total runtime. For MPICH2, the runtime increase is modest for each additional thread. For OpenMPI+progress_thread, the performance isn't as good as MPICH2 but it is still reasonable as the number of threads increases. However for OpenMPI without progress thread support, the runtime increases drastically as there are 3 child threads or more. The situation becomes very bad as the node becomes oversubscribed, i.e. when there are 5 or more child threads. Now we are going to use MPE logging and Jumpshot to find out what happens.

Table 4.2: The total runtime (in second) of the 4-process run of pthread_sendrecv with various number of child threads in different MPI implementations. The 2nd column header, MPICH2: refers to MPICH2-1.0.5p4 built with default sock channel which has MPI_THREAD_MULTIPLE support. The 3rd column header, OpenMPI+progress_thread, refers to OpenMPI-1.2.3 configured with -enable-mpi-threads and -enable-progress-threads. The 4th column, OpenMPI, refers to OpenMPI-1.2.3 built with -enable-mpi-threads which enables the MPI_THREAD_MULTIPLE support.

child thread count	MPICH2	OpenMPI+progress_thread	OpenMPI
1	0.025299	0.029545	0.029230
2	0.026213	0.030872	0.032966 4.7
3	0.028916	0.038964	0.050484 4.8
4	0.030145	0.045354	0.054791 4.9
5	0.031977	0.058039	0.149200 4.10
6	0.034462	0.058505	0.193399 4.11

The problematic data in the last column of Table 4.2 are being analyzed with two Jumpshot viewmaps for each run. They are shown in Figures 4.7, 4.8,4.9,4.10 and 4.11. The legend for these pictures are shown in Figure4.6.

1) Process-Thread view: where each thread timeline is shown nested under the process timeline it belongs to. Since we are only running 4 processes, only 4 process timelines here.

2) Communicator-Thread view: where each thread is shown nested within the communicator timeline. Since we are runing with 2 to 6 child threads where a duplicated MPI_COMM_WORLD is created for each thread, so we expect to see 3 to 7 major communicator timelines. MPI_COMM_WORLD is always labeled as 0 in CLOG2 converted SLOG-2 file and other duplicated MPI_Comm is labeled with other integer depends on the order of when it is being created.

When the timeline window of the process-thread view first shows up, only process timelines are visible, i.e. all the thread timelines are nested within the process timeline. User needs to use the Y-axis LabelExpand button Image TreeExpand24

or Alt-E to expand each process timeline to reveal the thread timeline. Similarly, user can use the Y-axis LabelCollapse button Image TreeCollapse24

or Alt-C to collapse the thread timeline back to their corresponding process timeline. Similarly for the communicator-thread view, the Y-axis LabelExpand and LabelCollapse buttons should be used to expand and collapse the communicator timelines.

Figures 4.8, 4.9, 4.10 and 4.11 clearly demonstrate that there is some kind of communication progress problem in OpenMPI when used without progress thread. Without alternating between communicator-thread and process-thread views, it would be difficult to identify the existence of a progress engine problem.

**Figure 4.7:** OpenMPI without progress thread: 2 child threads per process. As shown in both the Process-Thread view and Communicator-Thread views here, everything performs as expected.
[process-thread view] [communicator-thread view]

**Figure 4.8:** OpenMPI without progress thread: 3 child threads per process where there are 3 MPI_Comm_dup() calls in the master thread 0. As shown in the expanded Process-Thread view, the 3rd MPI_Comm_dup() call takes significantly longer than the first two MPI_Comm_dup(). The expanded Communicator-Thread view also suggests that the delayed 3rd MPI_Comm_dup() is blocking MPI point-to-point communication in the first two duplicated MPI_COMM_WORLD. As soon as the delayed MPI_Comm_dup() exits, the MPI point-to-point communication is restored.
[process-thread view] [communicator-thread view]

**Figure 4.9:** OpenMPI without progress thread: 4 child threads per process. Similar to Fig. 4.8, the 3rd MPI_Comm_dup() is delayed but not the 4th MPI_Comm_dup(). The interference between the delayed 3rd MPI_Comm_dup() and the other dup MPI_COMM_WORLD seen in Fig. 4.8 is not observed here. So the communication in first two dup MPI_COMM_WORLD finishes much earlier than the communication in the last two communicators.
[process-thread view] [communicator-thread view]

**Figure 4.10:** OpenMPI without progress thread: 5 child threads per process. Again, the last MPI_Comm_dup() takes longer than previous MPI_Comm_dup()s in finishing up. The feature that we observed in Fig. 4.8 that the delayed MPI_Comm_dup() is blocking other communicator's communication occurs here. However, even long after all MPI_Comm_dup() are done, there are many regions in the communicator-thread view that MPI communication is not progressing, i.e. some kind of temporary deadlock in the MPI progress engine may be happening here.
[process-thread view] [communicator-thread view]

**Figure 4.11:** OpenMPI without progress thread: 6 child threads per process. This is very similar to Fig. 4.10.
[process-thread view] [communicator-thread view]

4.4 Performance Analysis of Threaded MPI Application

Footnotes