Use ./configure --download-mumps --download-scalapack --download-parmetis --download-metis --download-ptscotch to have PETSc installed with MUMPS
Use ./configure --with-openmp --download-hwloc (or --with-hwloc) to enable running MUMPS in MPI+OpenMP hybrid mode and non-MUMPS in flat-MPI mode. See details below.
Use -pc_type cholesky or lu -pc_factor_mat_solver_type mumps to use this direct solver
|-mat_mumps_icntl_1||- ICNTL(1): output stream for error messages|
|-mat_mumps_icntl_2||- ICNTL(2): output stream for diagnostic printing, statistics, and warning|
|-mat_mumps_icntl_3||- ICNTL(3): output stream for global information, collected on the host|
|-mat_mumps_icntl_4||- ICNTL(4): level of printing (0 to 4)|
|-mat_mumps_icntl_6||- ICNTL(6): permutes to a zero-free diagonal and/or scale the matrix (0 to 7)|
|-mat_mumps_icntl_7||- ICNTL(7): computes a symmetric permutation in sequential analysis (0 to 7). 1=PETSc (sequential only) 3=Scotch, 4=PORD, 5=Metis -pc_factor_mat_ordering_type <type> will automatically provide the ordering so setting this flag to 1 is not needed|
|-mat_mumps_icntl_8||- ICNTL(8): scaling strategy (-2 to 8 or 77)|
|-mat_mumps_icntl_10||- ICNTL(10): max num of refinements|
|-mat_mumps_icntl_11||- ICNTL(11): statistics related to an error analysis (via -ksp_view)|
|-mat_mumps_icntl_12||- ICNTL(12): an ordering strategy for symmetric matrices (0 to 3)|
|-mat_mumps_icntl_13||- ICNTL(13): parallelism of the root node (enable ScaLAPACK) and its splitting|
|-mat_mumps_icntl_14||- ICNTL(14): percentage increase in the estimated working space|
|-mat_mumps_icntl_19||- ICNTL(19): computes the Schur complement|
|-mat_mumps_icntl_20||- ICNTL(20): give MUMPS centralized (0) or distributed (10) dense RHS|
|-mat_mumps_icntl_22||- ICNTL(22): in-core/out-of-core factorization and solve (0 or 1)|
|-mat_mumps_icntl_23||- ICNTL(23): max size of the working memory (MB) that can allocate per processor|
|-mat_mumps_icntl_24||- ICNTL(24): detection of null pivot rows (0 or 1)|
|-mat_mumps_icntl_25||- ICNTL(25): compute a solution of a deficient matrix and a null space basis|
|-mat_mumps_icntl_26||- ICNTL(26): drives the solution phase if a Schur complement matrix|
|-mat_mumps_icntl_28||- ICNTL(28): use 1 for sequential analysis and ictnl(7) ordering, or 2 for parallel analysis and ictnl(29) ordering|
|-mat_mumps_icntl_29||- ICNTL(29): parallel ordering 1 = ptscotch, 2 = parmetis|
|-mat_mumps_icntl_30||- ICNTL(30): compute user-specified set of entries in inv(A)|
|-mat_mumps_icntl_31||- ICNTL(31): indicates which factors may be discarded during factorization|
|-mat_mumps_icntl_33||- ICNTL(33): compute determinant|
|-mat_mumps_icntl_35||- ICNTL(35): level of activation of BLR (Block Low-Rank) feature|
|-mat_mumps_icntl_36||- ICNTL(36): controls the choice of BLR factorization variant|
|-mat_mumps_icntl_38||- ICNTL(38): sets the estimated compression rate of LU factors with BLR|
|-mat_mumps_cntl_1||- CNTL(1): relative pivoting threshold|
|-mat_mumps_cntl_2||- CNTL(2): stopping criterion of refinement|
|-mat_mumps_cntl_3||- CNTL(3): absolute pivoting threshold|
|-mat_mumps_cntl_4||- CNTL(4): value for static pivoting|
|-mat_mumps_cntl_5||- CNTL(5): fixation for null pivots|
|-mat_mumps_cntl_7||- CNTL(7): precision of the dropping parameter used during BLR factorization|
|-mat_mumps_use_omp_threads [m]||- run MUMPS in MPI+OpenMP hybrid mode as if omp_set_num_threads(m) is called before calling MUMPS. Default might be the number of cores per CPU package (socket) as reported by hwloc and suggested by the MUMPS manual.|
If run sequentially can use the PETSc provided ordering with the option -mat_mumps_icntl_7 1
When a MUMPS factorization fails inside a KSP solve, for example with a KSP_DIVERGED_PC_FAILED, one can find the MUMPS information about the failure by calling
MatMumpsGetInfog(mat,....); etc.Or you can run with -ksp_error_if_not_converged and the program will be stopped and the information printed in the error message.
Two modes to run MUMPS/PETSc with OpenMP
Set OMP_NUM_THREADS and run with fewer MPI ranks than cores. For example, if you want to have 16 OpenMP
threads per rank, then you may use "export OMP_NUM_THREADS=16 && mpirun -n 4 ./test".
-mat_mumps_use_omp_threads [m] and run your code with as many MPI ranks as the number of cores. For example,
if a compute node has 32 cores and you run on two nodes, you may use "mpirun -n 64 ./test -mat_mumps_use_omp_threads 16"
To run MUMPS in MPI+OpenMP hybrid mode (i.e., enable multithreading in MUMPS), but still run the non-MUMPS part (i.e., PETSc part) of your code in the so-called flat-MPI (aka pure-MPI) mode, you need to configure PETSc with --with-openmp --download-hwloc (or --with-hwloc), and have an MPI that supports MPI-3.0's process shared memory (which is usually available). Since MUMPS calls BLAS libraries, to really get performance, you should have multithreaded BLAS libraries such as Intel MKL, AMD ACML, Cray libSci or OpenBLAS (PETSc will automatically try to utilized a threaded BLAS if --with-openmp is provided).
If you run your code through a job submission system, there are caveats in MPI rank mapping. We use MPI_Comm_split_type() to obtain MPI processes on each compute node. Listing the processes in rank ascending order, we split processes on a node into consecutive groups of size m and create a communicator called omp_comm for each group. Rank 0 in an omp_comm is called the master rank, and others in the omp_comm are called slave ranks (or slaves). Only master ranks are seen to MUMPS and slaves are not. We will free CPUs assigned to slaves (might be set by CPU binding policies in job scripts) and make the CPUs available to the master so that OMP threads spawned by MUMPS can run on the CPUs. In a multi-socket compute node, MPI rank mapping is an issue. Still use the above example and suppose your compute node has two sockets, if you interleave MPI ranks on the two sockets, in other words, even ranks are placed on socket 0, and odd ranks are on socket 1, and bind MPI ranks to cores, then with -mat_mumps_use_omp_threads 16, a master rank (and threads it spawns) will use half cores in socket 0, and half cores in socket 1, that definitely hurts locality. On the other hand, if you map MPI ranks consecutively on the two sockets, then the problem will not happen. Therefore, when you use -mat_mumps_use_omp_threads, you need to keep an eye on your MPI rank mapping and CPU binding. For example, with the Slurm job scheduler, one can use srun --cpu-bind=verbose -m block:block to map consecutive MPI ranks to sockets and examine the mapping result.
PETSc does not control thread binding in MUMPS. So to get best performance, one still has to set OMP_PROC_BIND and OMP_PLACES in job scripts, for example, export OMP_PLACES=threads and export OMP_PROC_BIND=spread. One does not need to export OMP_NUM_THREADS=m in job scripts as PETSc calls omp_set_num_threads(m) internally before calling MUMPS.
|1.||- Heroux, Michael A., R. Brightwell, and Michael M. Wolf. "Bi-modal MPI and MPI+ threads computing on scalable multicore systems." IJHPCA (Submitted) (2011).|
|2.||- Gutierrez, Samuel K., et al. "Accommodating Thread-Level Heterogeneity in Coupled Parallel Applications." Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 2017.|