MPI Message Queue Dumping ------------------------- Author : James Cownie Version: 1.06 Date : 9 October 2000 Status : Change log : October 9 2000:JHC Added mqs_get_comm_group, MPI 2 support functions and discussion. March 21 2000: JHC Added the new function mqs_dll_taddr_width June 1 1999: JHC Added eurompi-paper.ps.gz to files May 23 1998: JHC Added extra information results on operations May 21 1998: JHC Added name for library TV uses on AIX. This document describes the interface between TotalView (or any other debugger) and a dynamic library which is used for extracting message queue information from an MPI program. Further details are contained in comments in the "mpi_interface.h" include file, which should be taken as definitive. If you want to use this interface with debuggers other than TotalView, please talk to us. It is extremely unlikely that we will object, however we would like to know who is using it, so that we can consult them before making changes to the interface (and it's bound to change !). The fundamental principle here is that the debugger dynamically loads a library *into* the debugger itself. Functions in the library are called by the debugger whenever it needs information about message queues in MPI processes. Since the library is running inside the debugger, it makes use of callbacks to the debugger to read target process store, and to convert the target byte ordered data into the format of the host on which the debugger is running (remember, we could be cross debugging...). The debugger also provides callbacks which allow the dll to extract information about the sizes of fundamental target types, and to look up structure types defined in the debug information in the executable image being debugged and find out field offsets within them. Interface for MPI 1 support --------------------------- This interface provides the basic support required to handle MPI message queue dumping in a static process, MPI-1 world. The functions and types here are all declared in the mpi_interface.h header no matter what #define symbols are defined. The extensions for MPI 2 are discussed below, they are upwards compatible with the MPI 2 interface, so a DLL built to this interface can continue to be used with a debugger which supports the full MPI 2 interface (though, of course, the MPI 2 features would not be available). Startup ------- Before any functions in the dll can be called, the debugger has to load the dynamic library. To do this it must find the name of the library to be loaded. The way that this works is outside the scope of the specification for the message queue dumping interface, however since this information is likely to be useful to people working with TotalView, here is a description of the way that we have chosen to implement it. After it has started up (or attached to) a parallel program, TotalView looks for a global symbol char MPIR_dll_name[]; in the target image (*NOT* the dll of course, since that hasn't been loaded yet). If this symbol exists, then its value is expected to be the name of a shared library to be loaded. If this symbol does not exist (or its value is an empty string), then TotalView will use a default library name which depends on the process attachment mechanism which was used, as follows AIX poe attachment libtvibmmpi.so then libtvmpich.so All platforms MPICH attachment libtvmpich.so SGI SGI/MPI attachment libxmpi.so If no library can be found, then MPI message queue dumping is disabled. On some platforms it may be necessary to take additional steps at compile or link time to prevent the apparently unreferenced global variable from being removed. This is compiler/linker specific. Startup calls in the DLL ------------------------ Once the debugger has loaded the library it will first check that all the functions it expects are present in the library, and that the library compatibility version number is that which it requires. If any of these tests fail, then it will issue a complaint and disable message queue dumping. These tests occur through the functions char *mqs_version_string (); int mqs_version_compatibility(); int mqs_dll_taddr_width(); in the DLL. mqs_dll_taddr_width() should _always_ be implemented as /* So the debugger can tell what interface width the library was compiled with */ int mqs_dll_taddr_width () { return sizeof (mqs_taddr_t); } /* mqs_dll_taddr_width */ Once these tests have been successfully passed, the debugger will call the initialisation function void mqs_setup_basic_callbacks (const mqs_basic_callbacks *); passing it an initialised structure containing the addresses of the call back functions in the debugger. This function is called *only* when the library is loaded, the DLL should save the pointer to the callback structure so that it can call the functions therein as required. For each executable image which is used by the processes that go to make up the parallel program TotalView will call the function int mqs_setup_image (mqs_image *, const mqs_image_callbacks *); This call should save image specific information by using the callback the debugger callback mqs_put_image_info (mqs_image *, mqs_image_info *); if it requires any, and return mqs_ok, or an error value. Provided that mqs_setup_image completes successfully, the debugger will then call int mqs_image_has_queues (mqs_image *, char **); mqs_image_has_queues returns mqs_ok or an error indicator to show whether the image has queues or not, and an error string to be used in a pop-up complaint to the user, as if in printf (error_string,name_of_image); The pop-up display is independent of the result. (So you can silently disable things, or loudly enable them). For each image which passes the mqs_image_has_queues test, the debugger will call int mqs_setup_process (mqs_process *, const mqs_process_callbacks *); on each process which is an instance of the image. This should be used to setup process specific information using void mqs_put_process_info (mqs_process *, mqs_process_info *); If this succeeds, then, to allow a final verification that the process has all of the necessary data in it to allow message queue display, int mqs_process_has_queues (mqs_process *, char **); will be called in an exactly analogous way to mqs_image_has_queues. Whenever an error code is returned from the DLL, the debugger should use the function char * mqs_dll_error_string (int errcode) to obtain a printable representation of the error, and provide it to the user in some appropriate way. That completes the initialisation. Queue display ------------- Before displaying the message queues the debugger will call the function void mqs_update_communicator_list (mqs_process *); The DLL can use this to check that it has an up to date model of the communicators in the process and (if necessary) update its list of communicators. The debugger will then iterate over each communicator, and within each communicator over each of the pending operations in each of the message queues using the functions int mqs_setup_communicator_iterator (mqs_process *); int mqs_get_communicator (mqs_process *, mqs_communicator *); int mqs_next_communicator (mqs_process *); to iterate over the communicators and int mqs_setup_operation_iterator (mqs_process *, int); int mqs_next_operation (mqs_process *, mqs_pending_operation *); to iterate over each of the operations on a specific queue for the current communicator. mqs_setup_*_iterator should return either mqs_ok if it has information mqs_no_information if no information is available about the requested queue or, if it can tell mqs_end_of_list if it has information but there are no operations It is permissible to return mqs_ok, but then have mqs_next_operation immediately return mqs_end_of_list, and similarly the debugger is at liberty to call mqs_next_* even if the initialisation function returned mqs_end_of_list. mqs_next_* should return mqs_ok if there is another element to look at (in the case of the operation iterator it has returned the operation), mqs_end_of_list if there are no more elements (in the case of the operation iterator it has *not* returned anything in the pending_operation). some other error code if it detects an error The useful information is returned in the structures /* A structure to represent a communicator */ typedef struct { taddr_t unique_id; /* A unique tag for the communicator */ tword_t local_rank; /* The rank of this process (Comm_rank) */ tword_t size; /* Comm_size */ char name[64]; /* the name if it has one */ } mqs_communicator; and typedef struct { /* Fields for all messages */ int status; /* Status of the message (really enum mqs_status) */ mqs_tword_t desired_local_rank; /* Rank of target/source -1 for ANY */ mqs_tword_t desired_global_rank; /* As above but in COMM_WORLD */ int tag_wild; /* Flag for wildcard receive */ mqs_tword_t desired_tag; /* Only if !tag_wild */ mqs_tword_t desired_length; /* Length of the message buffer */ int system_buffer; /* Is it a system or user buffer ? */ mqs_taddr_t buffer; /* Where data is */ /* Fields valid if status >= matched or it's a send */ mqs_tword_t actual_local_rank; /* Actual local rank */ mqs_tword_t actual_global_rank; /* As above but in COMM_WORLD */ mqs_tword_t actual_tag; mqs_tword_t actual_length; /* Additional strings which can be filled in if the DLL has more * info. (Uninterpreted by the debugger, simply displayed to the * user). * * Can be used to give the name of the function causing this request, * for instance. * * Up to five lines each of 64 characters. */ char extra_text[5][64]; } mqs_pending_operation; The debugger will display communicators in the order in which they are returned by the communicator iterator, and pending operations in the order returned by the operation iterator. It is up to the DLL to ensure that communicators are sorted into a sensible order, and that operations are shown in the order in which MPI would match them. Note that the name string can be used by the library to display any useful information about the communicator (the debugger does not interpret it at all, so you can put whatever string you like into the "name"); Similarly any additional information about a pending operation can be returned in the extra_text strings. For instance, if it is available, the name of the function which created the request could be returned. The extra_text strings are displayed until a zero length one is encountered, or all 5 have been displayed, so if any extra information has been returned the library must ensure that it nulls out the next string e.g. strcpy (res->extra_text[0],"Funny message"); res->extra_text[1][0] = 0; Otherwise uninitialized store will likely be displayed as characters :-( The function mqs_get_comm_group may be called by the debugger to extract from the DLL the set of processes which are used in a communicator. This call is not used while displaying the message queues (it is not required, since the message and communicator descriptions include global indices), however it will be used to support partial job acquisition (in which a user acquires only the processes in a given communicator, rather than all of the processes in the whole job). If mqs_get_comm_group is not present, the debugger should continue to function but will be unable to implement partial job acquisition. Shutdown -------- When the debugger terminates a process, or unloads information about an executable image, it will call back into the DLL to allow it to clean up any information it had associated with the process or image. Note that once loaded, the DLL itself is never unloaded, so you cannot rely on static variables in the DLL being re-initialized. MPI 2 extensions to the interface --------------------------------- The objective is to provide the current functionality in an MPI-2 environment where new processes may be being created by Comm_spawn (or Comm_spawn_multiple). Support for displaying the state of one-sided communications is not considered, though it may be added later. There are two distinct problems here :- 1) Determining that MPI_Spawn has occurred and locating the new processes. 2) Providing a unique name for each of the new processes. All of the functions and datatypes for MPI 2 support are only defined by mpi_interface.h if the preprocessor symbol FOR_MPI2 is defined to a non-zero value, otherwise the header only defines the MPI 1 interface. (This is to allow DLL authors who do not need the MPI-2 interface to continue to compile with the correct version number for the MPI-1 only interface). Collecting new processes ------------------------ The debugger must be informed by the MPI run time when new processes are created, and the DLL must also have a call to enable the debugger to enquire about processes which exist so that it can correctly locate all processes when attaching to an existing job which has already performed spawning operations. To handle the reporting of new processes, we add two new calls to the DLL from the debugger. extern int mqs_spawn_breakpoint_name (mqs_process *, char [], int); This function returns the name of a function on which the debugger should set a breakpoint. The breakpoint will be hit (i.e. that function will be called in the target process) when a spawn operation has completed and the DLL can extract from the target the information about the newly created processes. The result is returned in the second argument as a string, the amount of space allocated to hold the result is passed in the third argument. If the space is too small, an appropriate error should be returned. Once a process has hit the breakpoint above, then the debugger will call into the DLL to discover the new processes. As elsewhere in the debug interface, we use an iterator and iterate over the new processes. typedef struct mqs_process_location { long pid; char image_name [FILENAME_MAX]; char host_name [64]; } mqs_process_location; extern int mqs_setup_new_process_iterator (mqs_process *); extern int next_new_process (mqs_process *, mqs_process_location *); When attaching to a parallel job other than at job startup, the debugger will call mqs_setup_new_process_iterator () on each process which it acquires to collect any new processes which were spawned by that process. If the process has not spawned (or connected) to any new processes, then mqs_end_of_list should be returned. Once the debugger has acquired the new (spawned) processes, it will call mqs_set_process_identity (described below) to name the new process. The Process naming problem -------------------------- The fundamental problem we have is that in the interface between the debug DLL and the debugger the index of a process in COMM_WORLD is used as the "real" name of a process. For instance when dealing with a pending operation, the DLL fills in an mqs_pending_operation struct which contains the fields /* Fields for all messages */ mqs_tword_t desired_local_rank; /* Rank of target/source -1 for ANY */ mqs_tword_t desired_global_rank; /* As above but in COMM_WORLD */ /* Fields valid if status >= matched or it's a send */ mqs_tword_t actual_local_rank; /* Actual local rank */ mqs_tword_t actual_global_rank; /* As above but in COMM_WORLD */ to provide information about the source or target of a message. However, in MPI-2 as soon as the program has used Comm_spawn, there will be processes in the MPI job which do not have an index in COMM_WORLD, since they are not members of COMM_WORLD. It then becomes impossible to use the current interface without extension. The core issue here is that in a dynamic MPI-2 environment MPI does not force a process to have a unique name. Rather processes are addressed within communicators which are formed hierarchically, and, indeed, simultabeous process creation may be going on in two different communicators simultaneously without synchronisation between them. The Solution ------------ The proposed solution is to have the debugger generate a unique identifier (integer) for each process, and pass that down into the DLL at the time that the newly created process is registered with the debugger as a result of a Comm_spawn. This integer is then used in the results of the enquiry functions as though it were the index of the new process in COMM_WORLD. If you like, you can think of this as changing the global_rank fields to be indices into the debugger's concept of COMM_UNIVERSE, within which the debugger is allocating the indices. All of the current enquiry functions to the DLL can then continue to return exactly the same data structures, all that changes is the interpretation of the global rank fields (and that remains the same for MPI-1 programs). However, to make this workable from the debug library writer's perspective, we do need to add an additional object which is missing in the current library interface, i.e. a "job" object. This is necessary because the debug library writer needs some object on which to hang the translation table which will now be required to handle the translation form whatever internal name is used inside the MPI system for the spawned processes back to the index which the debugger wants to see. This cannot be held on the individual process object, since the name will turn up when looking at message queues in other processes, therefore it must be held on the job object which is accessible from all of the process objects. Therefore we have to extend the interface to add these new types, functions and callbacks :- /* * New types */ /* Information hung from a job object */ typedef struct _mqs_job_info mqs_job_info; /* A job object */ typedef struct mqs_job_ mqs_job; /* * New callbacks */ /* Given a job and an index return the process object */ typedef mqs_process * (*mqs_get_process_ft) (mqs_job *, int); typedef struct mqs_job_callbacks { mqs_get_process_ft mqs_get_process_fp; } mqs_job_callbacks; /* Hang information on the job */ typedef void (*mqs_put_job_info_ft) (mqs_job *, mqs_job_info *); /* Get it back */ typedef mqs_job_info * (*mqs_get_job_info_ft) (mqs_job *); /* Given a process, return the job it belongs to */ typedef mqs_job * (*mqs_get_process_job_ft) (mqs_job *); typedef int (*mqs_get_process_identity_ft) (mqs_job *); /* The callback tables are extended so that the basic callbacks * includes */ mqs_put_job_info_ft mqs_put_job_info_fp; mqs_get_job_info_ft mqs_get_job_info_fp; /* * and the process_callbacks include */ mqs_get_process_job_ft mqs_get_process_job_fp; mqs_get_process_identity_ft mqs_get_process_identity_fp; /* * New calls from the debugger into the DLL are defined * for managing the job object. */ extern int mqs_setup_job (mqs_job *, const mqs_job_callbacks *); extern int mqs_destroy_job_info (mqs_job_info *); /* * Finally we have the additional function we really need ! */ extern int mqs_set_process_identity (mqs_process *, int); Compatibility ------------- Although there may seem to be many changes here, in fact they remain upwards compatible with the existing interface, and MPI-1 libraries do not need to implement any of these new MPI-2 relevant calls. This is possible because in an MPI-1 environment the process identities remain the same as before these changes, and still represent the index of the process in COMM_WORLD. The new model that the process identity is actually allocated by the debugger and represents an index into a debugger maintained concept of COMM_UNIVERSE does not affect the processes in COMM_WORLD, since the debugger can arrange that its model of COMM_UNIVERSE includes COMM_WORLD as a prefix and, therefore, that the identities of processes in COMM_WORLD remains the same. When the debugger knows that the DLL supports the new MPI-2 interface, then it will call mqs_setup_job and mqs_set_process_identity when it attaches to an MPI job. When detaching, or when the job exits mqs_destroy_job_info will be called (after all of the processes have been destroyed). An MPI-1 only DLL need not provide any of these functions (and should use the mpi_interface.h file without the FOR_MPI2 symbol #defined (or defined as 0). Store allocation ---------------- Callbacks are provided to allow the DLL to use the debugger's heap package. These should be used in preference to calling malloc/free directly, since better debug information from store corruption may be possible. Type safety across the interface -------------------------------- All of the functions in the interface have complete prototypes, and there are opaque structures defined for all of the different types. This means that inside the functions in the library you will need explicit casts from these opaque types to the corresponding concrete types that you require. The use of explicit casts is unpleasant, however it seems hard to avoid this without mixing implementation and interface issues in the mpi_interface.h file. Miscellaneous issues -------------------- The debugger may be written in C++, and may not have been compiled with the "default" C++ compiler for the host on which it is running. You are extremely likely to run into library compatibility and version problems if you try to write the DLL in C++ as well. The dll should *not* perform operations which could cause it to block (such as reading from sockets, calling sleep or wait()), since this could cause the whole debugger to be blocked and unresponsive to user input. The dll should avoid declarations of external functions, ideally by using static functions. If this is not possible (because the DLL becomes larger than one file), then it should use external names with a suitable prefix, such as the vendor of the MPI library, e.g. "Mqs_sgi_", or "Mqs_ibm_". Sample code ----------- With this README you should also have eurompi-paper.ps.gz A paper by James Cownie and Bill Gropp which describes the interface, presented at the "EuroPVM and MPI 99" conference. mpi_interface.h The header which defines the interface Makefile.simple A simple makefile to build a dynamic library on Solaris dll_mpich.c Sample example code which uses this interface to provide message queue dumping for MPICH 1.x mpich_dll_defs.h Header for the sample code. The last three files are not a part of the interface, or definition, but rather a sample user of the interface, provided on an as is basis so that you can see how the interface can actually be used. mpich-attach.txt Documentation on the process acquisition technique used with MPICH (and recommended for other third party MPI implementations).