简体   繁体   中英

Job fails while using srun or mpirun in slurm

I am trying to submit a job with slurm. However, the job fails if I use srun or mpirun . However, it runs fine with mpiexec , albeit running with only single process despite multiple nodes and multiple cores being allocated.

The actual command used is:

srun /nfs/home/6/sanjeevis/dns/lb3d/src/lbe -f input-default

Following is the error I get with srun/mpirun :

[mpiexec@n1581] match_arg (utils/args/args.c:163): unrecognized argument pmi_args
[mpiexec@n1581] HYDU_parse_array (utils/args/args.c:178): argument matching returned error
[mpiexec@n1581] parse_args (ui/mpich/utils.c:1642): error parsing input array
[mpiexec@n1581] HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1694): unable to parse user arguments

The code compiles fine but I am facing issues through slurm. Any help on this is appreciated.

Edit: Here are the output for which mpirun, mpiexec, and ldd of the executable:

/nfs/apps/MPI/openmpi/3.1.3/gnu/6.5.0/cuda/9.0/bin/mpirun
/nfs/apps/ParaView/5.8/binary/bin/mpiexec
        linux-vdso.so.1 =>  (0x00007fff78255000)
        libmpi.so.12 => /nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/intel64/lib/release_mt/libmpi.so.12 (0x00002ae6cb57d000)
        libz.so.1 => /nfs/apps/Libraries/zlib/1.2.11/system/lib/libz.so.1 (0x00002ae6cbd4c000)
        libmpifort.so.12 => /nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/intel64/lib/libmpifort.so.12 (0x00002ae6cbf67000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002ae6cc315000)
        librt.so.1 => /lib64/librt.so.1 (0x00002ae6cc519000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ae6cc721000)
        libm.so.6 => /lib64/libm.so.6 (0x00002ae6cc93e000)
        libc.so.6 => /lib64/libc.so.6 (0x00002ae6ccc40000)
        libgcc_s.so.1 => /nfs/apps/Compilers/GNU/6.5.0/lib64/libgcc_s.so.1 (0x00002ae6cd003000)
        /lib64/ld-linux-x86-64.so.2 (0x0000558ea723a000)

Here is my job script .

The root cause is the mix of several MPI implementations that do not inter operate:

  • mpirun is from Open MPI
  • mpiexec is likely the builtin MPICH from Paraview
  • your app is built with Intel MPI.

Try using /nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/bin/mpirun (or /nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/bin64/mpirun ) instead so the launcher will match your MPI library.

If you want to use srun with Intel MPI, an extra step is required. You first need to

export I_MPI_PMI_LIBRARY=/path/to/slurm/pmi/library/libpmi.so

The most likely problem is that the program is compiled with one MPI implementation and called with another MPI implementation. Make sure that all MPI environment variables are set correctly: OPAL_PREFIX, MPI_ROOT, PATH, and LD_LIBRARY_PATH.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM