MPICH Frequently Asked Questions

If you do not see the descriptions, click on the item to geta description of the topic. These two items may be used to expand and collapse all entries

Expand all answers
Collapse all answers

  • Introduction
  • Installing MPICH
  • Using MPICH
  • Permission Denied
  • Notes on getting MPICH running Under Linux
  • poll: protocol failure during circuit creation
  • Using SSH
  • Mac OS X and hostname
  • SIGSEGV
  • semop lock failed
  • Compiler Switches
  • C++ Builds Fail
  • Fortran programs give errors about mismatched types
  • Missing Symbols When Linking
  • Warning messages while building MPICH
  • MPMD (Multiple Program Multiple Data) Programs
  • Reporting problems and support
  • Algorithms used in MPICH
  • Jumpshot and X11
  • Introduction

    MPICH is a freely available, portable implementation of MPI, the Standard for message-passing libraries.

    Installing MPICH

    Building and installing MPICH often requires only
     ./configure --prefix=/home/me/mpich
     make
     make install
    
    where the value of the --prefix argument to configure is the directory in which MPICH should be installed. See the Installation Guide for more detailed instructions.

    Using MPICH

    Information on using MPICH can be found in the Installation and Users' Guide. There are versions of this guide for each MPICH device; make sure that you read the one the applies to your device. You can find out which device is installed by running a simple MPI program with the option -mpiversion:
        mpirun -np 1 a.out -mpiversion
    
    All of the guides are available at http://www.mcs.anl.gov/mpi/mpich1/docs.html .

    Permission Denied

    Question:
    When I use mpirun, I get the message Permission denied, connection reset by peer, or poll: protocol failure in circuit setup when trying to run MPICH.

    Answer:
    If you see something like this

        % mpirun -np 2 cpi 
        Permission denied.
    
    (or connection reset by peer or poll: protocol failure in circuit setup) when using the ch_p4 device, it probably means that you do not have permission to use rsh to start processes. The script tstmachines can be used to test this. For example, if the architecture type (the -arch argument to configure) is sun4, then try
        tstmachines sun4
    
    If this fails, then you may need a .rhosts or \file{/etc/hosts.equiv} file (you may need to see your system administrator) or you may need to use the p4 server. Another possible problem is the choice of the remote shell program; some systems have several. Check with your systems administrator about which version of rsh or remsh you should be using.

    If your system allows a .rhosts file, do the following:

  • Create a file .rhosts in your home directory
  • Change the protection on it to user read/write only: chmod og-rwx .rhosts.
  • Add one line to the .rhosts file for each processor that you want to use. The format is
  • host username
    
    For example, if your username is doe and you want to user machines a.our.org and b.our.org, your .rhosts file should contain
    a.our.org doe
    b.our.org doe
    
    Note the use of fully qualified host names (some systems require this).

    On networks where the use of .rhosts files is not allowed, (such as the one in MCS at Argonne), you should use the p4 server to run on machines that are not trusted by the machine that you are initiating the job from.

    Finally, you may need to use a non-standard rsh command within MPICH. MPICH must be reconfigured with -rsh=command_name, and perhaps also with -rshnol if the remote shell command does not support the -l argument. Systems using Kerberos and/or AFS may need this.

    Notes on getting MPICH running Under Linux

    Introduction

    The purpose of this document is to describe the steps necessary to allow MPICH processes to be started and to communicate with one another. The installation that we are focusing on is a RedHat 7.2 installation with medium security (the default). While other distributions will certainly vary, this is a good example of the sorts of problems that one might run across. There are three methods for starting MPICH processes that are typically used on clusters today. These are rsh, ssh, and mpd. We will first describe getting the rsh service working. We will include rlogin in this process because it is helpful for testing. Next we will describe getting ssh working and enabling the ssh-agent to allow for logins without password typing. Finally we will discuss issues related to process communication in MPICH and firewalls.

    Enabling rsh

    By default the rsh server is not installed, and it is necessary for use of the rsh service in starting MPICH processes. The rsh server, in.rshd, is part of the rsh-server RPM. This RPM is located on the first disc of the RedHat 7.2 distribution. The rlogin server, in.rlogind, is also included in this package. The xinetd server controls the availability of the rsh and rlogin services. This server is installed by default, but by default rsh and rlogin services are disabled. To enable these services, you must edit the files /etc/xinetd.d/rsh and \file{/etc/xinetd.d/rlogin}. Here is the rsh file as it looks by default:
    # default: on
    # description: The rshd server is the server for the rcmd(3) routine and, \
    #       consequently, for the rsh(1) program.  The server provides \
    #       remote execution facilities with authentication based on \
    #       privileged port numbers from trusted hosts.
    service shell
    {
            socket_type             = stream
            wait                    = no
            user                    = root
            log_on_success          += USERID
            log_on_failure          += USERID
            server                  = /usr/sbin/in.rshd
            disable                 = yes
    }
    
    You must enable the service by changing "disable = yes" to "disable = no". The same must be done to the rlogin config file to enable that service. At this point the xinetd daemon must be restarted to register these changes:
    /etc/rc.d/init.d/xinetd restart
    
    At this point you should receive a "Permission denied." if you attempt a command such as "rsh localhost hostname" as a non-root user (or as root for that matter). To allow users to rsh without passwords you need to edit /etc/hosts.equiv, the system-wide host file for rsh and rlogin. This file should hold hostnames of machines that you would like users to be able to start MPICH processes from. For example, simply adding:
    localhost.localdomain
    
    Should allow users to perform the command "rsh localhost hostname" successfully. Likewise adding other hostnames will allow users on those hosts to rsh to this host. However, there is another catch! By default (with medium security) packet filtering is enabled as well, and this will prevent users from remote hosts from connecting to this machine using the rsh or rlogin services. This packet filter, or firewall, is administered using the ipchains package (which is installed by default). The firewall configuration is written out by a program called lokkit at installation time (I think). The configuration is stored in /etc/sysconfig/ipchains and by default looks like this:
    # Firewall configuration written by lokkit
    # Manual customization of this file is not recommended.
    # Note: ifup-post will punch the current nameservers through the
    #       firewall; such entries will *not* be listed here.
    :input ACCEPT
    :forward ACCEPT
    :output ACCEPT
    -A input -s 0/0 67:68 -d 0/0 67:68 -p udp -i eth0 -j ACCEPT
    -A input -s 0/0 67:68 -d 0/0 67:68 -p udp -i eth1 -j ACCEPT
    -A input -s 0/0 -d 0/0 -i lo -j ACCEPT
    -A input -p tcp -s 0/0 -d 0/0 0:1023 -y -j REJECT
    -A input -p tcp -s 0/0 -d 0/0 2049 -y -j REJECT
    -A input -p udp -s 0/0 -d 0/0 0:1023 -j REJECT
    -A input -p udp -s 0/0 -d 0/0 2049 -j REJECT
    -A input -p tcp -s 0/0 -d 0/0 6000:6009 -y -j REJECT
    -A input -p tcp -s 0/0 -d 0/0 7100 -y -j REJECT
    
    While an in-depth discussion of ipchains rules is outside the context of this document, it's worth talking about how this works a bit. First, the rules are applied in order from top of the list to the bottom of the list. The argument to -j says what to do if a packet matches; it's usually either ACCEPT (let the packet in), or REJECT (toss it out). If a packet makes it through the entire list then the default policy is applied. In this case the default policy is ACCEPT. The following line tells the packet filter to allow all localhost (-i lo) traffic to pass unmolested:
    -A input -s 0/0 -d 0/0 -i lo -j ACCEPT
    
    This line blocks all new TCP connections going to ports 0-1023, which is the range of most services, including rsh/rlogin:
    -A input -p tcp -s 0/0 -d 0/0 0:1023 -y -j REJECT
    
    We're going to modify this file to allow rsh and rlogin traffic.
    # Firewall configuration written by lokkit
    # Manual customization of this file is not recommended.
    # Note: ifup-post will punch the current nameservers through the
    #       firewall; such entries will *not* be listed here.
    :input ACCEPT
    :forward ACCEPT
    :output ACCEPT
    -A input -s 0/0 67:68 -d 0/0 67:68 -p udp -i eth0 -j ACCEPT
    -A input -s 0/0 67:68 -d 0/0 67:68 -p udp -i eth1 -j ACCEPT
    -A input -s 0/0 -d 0/0 -i lo -j ACCEPT
    #
    # New rules for rlogin/rsh traffic, incoming or outgoing
    #
    -A input -p tcp -s 0/0 -d 0/0 513 -b -j ACCEPT
    -A input -p tcp -s 0/0 -d 0/0 514 -b -j ACCEPT
    #
    # End of new rules
    #
    -A input -p tcp -s 0/0 -d 0/0 0:1023 -y -j REJECT
    -A input -p tcp -s 0/0 -d 0/0 2049 -y -j REJECT
    -A input -p udp -s 0/0 -d 0/0 0:1023 -j REJECT
    -A input -p udp -s 0/0 -d 0/0 2049 -j REJECT
    -A input -p tcp -s 0/0 -d 0/0 6000:6009 -y -j REJECT
    -A input -p tcp -s 0/0 -d 0/0 7100 -y -j REJECT
    
    At this point users on remote systems with accounts on this system should be able to rsh/rlogin to this machine without using a password.

    Enabling ssh

    Enabling ssh is somewhat easier. First the ssh server, sshd, must be installed. This is part of the openssh-server RPM. This RPM is located on the first disc of the RedHat 7.2 distribution. Once the server is installed, it must be started:
    /etc/rc.d/init.d/sshd start
    
    The service will be automatically started on reboot. At this point ssh on the localhost should work, although a password will still be required. However, our firewall rules will be preventing connections from other machines. We again modify /etc/sysconfig/ipchains, this time to allow ssh traffic in and out. See the above section for a discussion of what we are doing here.
    # Firewall configuration written by lokkit
    # Manual customization of this file is not recommended.
    # Note: ifup-post will punch the current nameservers through the
    #       firewall; such entries will *not* be listed here.
    :input ACCEPT
    :forward ACCEPT
    :output ACCEPT
    -A input -s 0/0 67:68 -d 0/0 67:68 -p udp -i eth0 -j ACCEPT
    -A input -s 0/0 67:68 -d 0/0 67:68 -p udp -i eth1 -j ACCEPT
    -A input -s 0/0 -d 0/0 -i lo -j ACCEPT
    #
    # New rules for ssh traffic, incoming or outgoing
    # 
    -A input -p tcp -s 0/0 -d 0/0 22 -b -j ACCEPT
    #
    # End of new rules
    #
    -A input -p tcp -s 0/0 -d 0/0 0:1023 -y -j REJECT
    -A input -p tcp -s 0/0 -d 0/0 2049 -y -j REJECT
    -A input -p udp -s 0/0 -d 0/0 0:1023 -j REJECT
    -A input -p udp -s 0/0 -d 0/0 2049 -j REJECT
    -A input -p tcp -s 0/0 -d 0/0 6000:6009 -y -j REJECT
    -A input -p tcp -s 0/0 -d 0/0 7100 -y -j REJECT
    
    At this point users on remote systems should be able to ssh into the machine, but they will still need a password. Users should set up a private/public authentication key pair in order for ssh to operate without passwords. This process is documented in the installation guide, but a summary of the steps for RH7.2 will be included here. First run the "ssh-keygen -t rsa" application to create the private/public key pair. By default this will create the files ~/.ssh/id_rsa and ~/.ssh/id_rsa.pub. Use a password. Next place the public key (~.ssh/id_rsa.pub) in the file ~/.ssh/authorized_keys. If more than one machine is going to be used, then this key must be put in the ~/.ssh/authorized_keys file on each machine. The permissions on the .ssh directory should be set to 700; otherwise the sshd may choose to not accept the keys. This will allow you to connect using rsa keys rather than simple UNIX passwords. The next step is to enable an SSH agent so that you do not need to repeatedly type your password. The agent is started with "ssh-agent <cmd>". Typically <cmd> is $SHELL, so that your default shell is started. The agent will then handle authentication on your behalf any time you attempt to use ssh from this shell. To give the ssh-agent your password, type "ssh-add". This will query you for the passphrase that accompanies your rsa key. Once you have completed this, you will be able to ssh to other systems on which your key is authorized without typing a password.

    Interprocess communication

    MPICH processes use the standard UNIX mechanisms for allocating ports for intercommunication. Using this mechanism processes are given ports in the range of 1024--65535. Unfortunately for us, the default firewall configuration blocks some port ranges that our MPICH processes might be given to use for communication. This leads to a situation where MPICH applications will occasionally fail to communicate (when they happen to get the wrong port value). We're going to modify the ipchains configuration file to remove lines disabling ranges of ports that our processes might use for intercommunication. The two default rules of interest are the following:
    -A input -p tcp -s 0/0 -d 0/0 6000:6009 -y -j REJECT
    -A input -p tcp -s 0/0 -d 0/0 7100 -y -j REJECT
    
    The first blocks incoming TCP connections to ports 6000-6009 (often used by X), while the second blocks incoming TCP connections to port 7100 (often used by the X font server). We simply remove these rules:
    # Firewall configuration written by lokkit
    # Manual customization of this file is not recommended.
    # Note: ifup-post will punch the current nameservers through the
    #       firewall; such entries will *not* be listed here.
    :input ACCEPT
    :forward ACCEPT
    :output ACCEPT
    -A input -s 0/0 67:68 -d 0/0 67:68 -p udp -i eth0 -j ACCEPT
    -A input -s 0/0 67:68 -d 0/0 67:68 -p udp -i eth1 -j ACCEPT
    -A input -s 0/0 -d 0/0 -i lo -j ACCEPT
    -A input -p tcp -s 0/0 -d 0/0 0:1023 -y -j REJECT
    -A input -p tcp -s 0/0 -d 0/0 2049 -y -j REJECT
    -A input -p udp -s 0/0 -d 0/0 0:1023 -j REJECT
    -A input -p udp -s 0/0 -d 0/0 2049 -j REJECT
    #
    # Removed these rules to eliminate chance of MPICH comm. failure
    #
    # -A input -p tcp -s 0/0 -d 0/0 6000:6009 -y -j REJECT
    # -A input -p tcp -s 0/0 -d 0/0 7100 -y -j REJECT
    #
    # End of removed rules
    #
    
    This modification, in conjunction with one to allow process startup, should prepare your system for MPICH jobs.

    poll: protocol failure during circuit creation

    You may see this message if you attempt to run too many MPI programs in a short period of time. For example, in Linux and when using the ch_p4 device (without the secure server or ssh), MPICH uses rsh to start the MPI processes. Depending on the particular Linux distribution and verison, there may be a limit of as few as 40 processes per minute. When running the MPICH test suite or starting short parallel jobs from a script, it is possible to exceed this limit.

    To fix this, you can do one of the following:

    1. Wait a few seconds between running parallel jobs. You may need to wait up to a minute.
    2. Modify /etc/inetd.conf to allow more processes per minute for rsh. For example, change
      shell	stream	tcp	nowait	root	/etc/tcpd2	in.rshd 
      
      to
      shell	stream	tcp	nowait.200	root	/etc/tcpd2	in.rshd 
      

    3. Use the ch_p4mpd device or the secure server option of the ch_p4 device instead. Neither of these relies on inetd.

    Using SSH

    The secure shell (ssh) may be used with the ch_p4 device, but requires careful setup. See configuring with ssh in the Installation and User's manual.

    Make sure that ssh is set up to not require a password. The command

     ssh -n `hostname` date
    
    should return the date without any prompts for passwords. See the installation manual if you have problems.

    Mac OS X and hostname

    Under Mac OS X, the hostname handling is unusual. The hostname that the hostname command or the gethostname function return is determined by the name set in the ``Sharing preference'' pane. If this name contains a space, you may get only the leading part of the name, upto but not including the first space. This name is stored in the file /etc/hostconfig. MPICH, for the ch_p4 and ch_p4mpd device, needs a sensible name returned by both gethostname and hostname. If you are using a single machine in a standalone configuration, the easiest option is to change the name of the machine in /etc/hostconfig to localhost (use sudo). If you wish to use another name, you need to make the name in /etc/hostconfig match the name in the NetInfo database.

    Thanks to Victor Eijkhout for this information.

    SIGSEGV

    If your program fails with
    p4_error: interrupt SIGSEGV
    
    the problem is probably not with MPI. Instead, check for program bugs including
    1. Array overwrites or accesses beyond array bounds. Be particularly careful of a[size] in C, where a is declared as int a[size].
    2. Invalid pointers, including null pointers.
    3. Missing or mismatched parameters to subroutines or functions. Fortran users should check that all MPI calls include the integer error return parameter and that any status variable is dimensioned as an array of size MPI_STATUS_SIZE.

    semop lock failed

    When running the ch_p4 device with SMP support (-comm=shared), you may occasionally see the message
    "p1_13043:  p4_error: OOPS: semop lock failed"
    
    To fix this, try running the script cleanipcs that is included with MPICH. You can also use the command ipcs to list the shared memory and semaphore resources that are in use on a node. This can help you track down resources that are held by a different user that are preventing your MPI program from running.

    Compiler Switches

    Normally, you should let configure determine compiler switches. However, you can use the configure options -cflags=... and -fflags=... to specify special flags. See also compiler switches.

    C++ Builds Fail

    If the C++ build fails with messages about ambiquities in the definitions, try reconfiguring and rebuilding with the --without-profiling options to the C++ configure. In later versions of MPICH, a different C++ interface will be provided that should fix this problem.

    Fortran programs give errors about mismatched types

    Some Fortran compilers will complain if a subroutine is passed arguments of different types in the same parameter position. For example, if MPI_Bcast is used with an integer in one place and a double precision value in another, you may see
     Argument #1 of `mpi_bcast' is one type at (2) but is some other type at (1) 
    
    This is a strict interpretation of the Fortran standard. To fix this, you will need to tell the compiler to allow this usage. For the GNU g77 compiler, add the command-line option -Wno-globals, as in
     mpif77 -Wno-globals mycode.f
    
    An alternative is to use a Fortran 90 or Fortran 95 compiler with the MPI module instead of the mpif.h header file.

    Missing Symbols When Linking

    If you get errors about missing symbols, such as
    mpicc  -o overtake overtake.o test.o 
    ld: 0711-317 ERROR: Undefined symbol: MPIR_F_TRUE
    ld: 0711-317 ERROR: Undefined symbol: .MPIR_InitFortranDatatypes
    ld: 0711-317 ERROR: Undefined symbol: MPIR_F_FALSE
    ld: 0711-317 ERROR: Undefined symbol: .MPIR_InitFortran
    ld: 0711-317 ERROR: Undefined symbol: MPIR_I_DCOMPLEX
    ld: 0711-317 ERROR: Undefined symbol: .MPIR_Free_Fortran_dtes
    ld: 0711-317 ERROR: Undefined symbol: .MPIR_Free_Fortran_keyvals
    
    this usually indicates an error in the make process. For some reason, the Fortran part (which is where these symbols come from) is particularly fragile. To fix this, try the following steps:
        cd src/fortran/src
        make clean
        make
        ar ../../../lib/libmpich.a *.o
        ranlib ../../../lib/libmpich.a
    
    If weak symbols are not supported, then in addition, do these additional steps:
        make clean
        make profile
        ar ../../../lib/libpmpich.a *.o
        ranlib ../../../lib/libpmpich.a
    
    The problem is that some versions of make have logic errors that cause them to create files but not to act on them; this causes make to build the object file but then fail to include it in the archive. The above steps should work around this problem.

    Warning messages while building MPICH

    Some compilers may generate a large number of warnings of the form % artificially broken line to make it fit on page
         "commreq_free.c", line 70: warning #187: use of "=" where "==" may 
         have been intended       
    
    There is nothing wrong with these statements. The compiler is warning about a legal, but often misused, feature of the C language. The statements have been crafted so that most compilers recognize that the "=" was used intentionally; unfortunately, some compilers insists on warning about this valid use of C and provide no way to indicate to the compiler that the warning is unnecessary.

    MPMD (Multiple Program Multiple Data) Programs

    MPICH, depending on the device, supports MPMD programs. However, the mpirun script currently does not support MPMD programs. For the ch_p4 device, the user must create a procgroup file and invoke the program that will have rank zero in MPI_COMM_WORLD with the command-line option -p4pg filename. See the Installation and User's Guide for more information.

    Reporting problems and support

    1. First, check the list of known bugs and patches for the problem you are seeing.
    2. Also check the troubleshooting guides in the Installation and Users guides.
    3. If that doesn't help, send mail to mpi-bugs@mcs.anl.gov.

    Algorithms used in MPICH

    1. Does MPICH use IP Multicast for MPI_Bcast?

      No. In principle, MPICH could use multicast, but in practice this would be very difficult. To start with, IP multicast is unreliable; additional code to make it reliable needs to be added. In fact, there is an effort to provide a reliable multicast, built ontop of the unreliable multicast. The second problem is that not all systems allow user programs (or any program) to perform an IP multicast. In fact, that is the case for the systems that we have been developing on. Thus, we will always need the point-to-point version. There is a fairly easy way to replace any collective routine in MPI, but no-one has offered us a multicast-based MPI_Bcast yet...

    Jumpshot and X11

    Jumpshot relies on the AWT/Swing toolkit to perform its graphics operations. We have noticed some problems with AWT/Swing and X11 window system support on some PC's running Windows. In particular, the combination of SecureCRT and Exceed or SecureCRT and Xfree86 often have problems (sometimes serious, including hangs and crashes). For now, the only workaround is to use OpenSSH and Xfree86 (with Cygwin) if possible when using a Windows PC.