简体   繁体   English

在 slurm 中使用 srun 或 mpirun 时作业失败

[英]Job fails while using srun or mpirun in slurm

I am trying to submit a job with slurm.我正在尝试用 slurm 提交一份工作。 However, the job fails if I use srun or mpirun .但是,如果我使用srunmpirun ,作业会失败。 However, it runs fine with mpiexec , albeit running with only single process despite multiple nodes and multiple cores being allocated.但是,它与mpiexec一起运行良好,尽管尽管分配了多个节点和多个内核,但它只运行一个进程。

The actual command used is:实际使用的命令是:

srun /nfs/home/6/sanjeevis/dns/lb3d/src/lbe -f input-default

Following is the error I get with srun/mpirun :以下是我使用srun/mpirun得到的错误:

[mpiexec@n1581] match_arg (utils/args/args.c:163): unrecognized argument pmi_args
[mpiexec@n1581] HYDU_parse_array (utils/args/args.c:178): argument matching returned error
[mpiexec@n1581] parse_args (ui/mpich/utils.c:1642): error parsing input array
[mpiexec@n1581] HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1694): unable to parse user arguments

The code compiles fine but I am facing issues through slurm.代码编译得很好,但我通过 slurm 面临问题。 Any help on this is appreciated.对此的任何帮助表示赞赏。

Edit: Here are the output for which mpirun, mpiexec, and ldd of the executable:编辑:以下是可执行文件的which mpirun, mpiexec, and ldd的 output:

/nfs/apps/MPI/openmpi/3.1.3/gnu/6.5.0/cuda/9.0/bin/mpirun
/nfs/apps/ParaView/5.8/binary/bin/mpiexec
        linux-vdso.so.1 =>  (0x00007fff78255000)
        libmpi.so.12 => /nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/intel64/lib/release_mt/libmpi.so.12 (0x00002ae6cb57d000)
        libz.so.1 => /nfs/apps/Libraries/zlib/1.2.11/system/lib/libz.so.1 (0x00002ae6cbd4c000)
        libmpifort.so.12 => /nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/intel64/lib/libmpifort.so.12 (0x00002ae6cbf67000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002ae6cc315000)
        librt.so.1 => /lib64/librt.so.1 (0x00002ae6cc519000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ae6cc721000)
        libm.so.6 => /lib64/libm.so.6 (0x00002ae6cc93e000)
        libc.so.6 => /lib64/libc.so.6 (0x00002ae6ccc40000)
        libgcc_s.so.1 => /nfs/apps/Compilers/GNU/6.5.0/lib64/libgcc_s.so.1 (0x00002ae6cd003000)
        /lib64/ld-linux-x86-64.so.2 (0x0000558ea723a000)

Here is my job script .这是我的工作脚本

The root cause is the mix of several MPI implementations that do not inter operate:根本原因是几种不互操作的 MPI 实现的混合:

  • mpirun is from Open MPI mpirun来自 Open MPI
  • mpiexec is likely the builtin MPICH from Paraview mpiexec可能是 Paraview 的内置 MPICH
  • your app is built with Intel MPI.您的应用是使用英特尔 MPI 构建的。

Try using /nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/bin/mpirun (or /nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/bin64/mpirun ) instead so the launcher will match your MPI library.尝试使用/nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/bin/mpirun (或/nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/bin64/mpirun ) 而是这样启动器将匹配您的 MPI 库。

If you want to use srun with Intel MPI, an extra step is required.如果您想将srun与英特尔 MPI 一起使用,则需要一个额外的步骤。 You first need to你首先需要

export I_MPI_PMI_LIBRARY=/path/to/slurm/pmi/library/libpmi.so

The most likely problem is that the program is compiled with one MPI implementation and called with another MPI implementation.最可能的问题是程序是用一个 MPI 实现编译并用另一个 MPI 实现调用的。 Make sure that all MPI environment variables are set correctly: OPAL_PREFIX, MPI_ROOT, PATH, and LD_LIBRARY_PATH.确保正确设置所有 MPI 环境变量:OPAL_PREFIX、MPI_ROOT、PATH 和 LD_LIBRARY_PATH。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM