[英]Job fails while using srun or mpirun in slurm
I am trying to submit a job with slurm.我正在尝试用 slurm 提交一份工作。 However, the job fails if I use
srun
or mpirun
.但是,如果我使用
srun
或mpirun
,作业会失败。 However, it runs fine with mpiexec
, albeit running with only single process despite multiple nodes and multiple cores being allocated.但是,它与
mpiexec
一起运行良好,尽管尽管分配了多个节点和多个内核,但它只运行一个进程。
The actual command used is:实际使用的命令是:
srun /nfs/home/6/sanjeevis/dns/lb3d/src/lbe -f input-default
Following is the error I get with srun/mpirun
:以下是我使用
srun/mpirun
得到的错误:
[mpiexec@n1581] match_arg (utils/args/args.c:163): unrecognized argument pmi_args
[mpiexec@n1581] HYDU_parse_array (utils/args/args.c:178): argument matching returned error
[mpiexec@n1581] parse_args (ui/mpich/utils.c:1642): error parsing input array
[mpiexec@n1581] HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1694): unable to parse user arguments
The code compiles fine but I am facing issues through slurm.代码编译得很好,但我通过 slurm 面临问题。 Any help on this is appreciated.
对此的任何帮助表示赞赏。
Edit: Here are the output for which mpirun, mpiexec, and ldd
of the executable:编辑:以下是可执行文件的
which mpirun, mpiexec, and ldd
的 output:
/nfs/apps/MPI/openmpi/3.1.3/gnu/6.5.0/cuda/9.0/bin/mpirun
/nfs/apps/ParaView/5.8/binary/bin/mpiexec
linux-vdso.so.1 => (0x00007fff78255000)
libmpi.so.12 => /nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/intel64/lib/release_mt/libmpi.so.12 (0x00002ae6cb57d000)
libz.so.1 => /nfs/apps/Libraries/zlib/1.2.11/system/lib/libz.so.1 (0x00002ae6cbd4c000)
libmpifort.so.12 => /nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/intel64/lib/libmpifort.so.12 (0x00002ae6cbf67000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002ae6cc315000)
librt.so.1 => /lib64/librt.so.1 (0x00002ae6cc519000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ae6cc721000)
libm.so.6 => /lib64/libm.so.6 (0x00002ae6cc93e000)
libc.so.6 => /lib64/libc.so.6 (0x00002ae6ccc40000)
libgcc_s.so.1 => /nfs/apps/Compilers/GNU/6.5.0/lib64/libgcc_s.so.1 (0x00002ae6cd003000)
/lib64/ld-linux-x86-64.so.2 (0x0000558ea723a000)
Here is my job script .这是我的工作脚本。
The root cause is the mix of several MPI implementations that do not inter operate:根本原因是几种不互操作的 MPI 实现的混合:
mpirun
is from Open MPI mpirun
来自 Open MPImpiexec
is likely the builtin MPICH from Paraview mpiexec
可能是 Paraview 的内置 MPICH Try using /nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/bin/mpirun
(or /nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/bin64/mpirun
) instead so the launcher will match your MPI library.尝试使用
/nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/bin/mpirun
(或/nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/bin64/mpirun
) 而是这样启动器将匹配您的 MPI 库。
If you want to use srun
with Intel MPI, an extra step is required.如果您想将
srun
与英特尔 MPI 一起使用,则需要一个额外的步骤。 You first need to你首先需要
export I_MPI_PMI_LIBRARY=/path/to/slurm/pmi/library/libpmi.so
The most likely problem is that the program is compiled with one MPI implementation and called with another MPI implementation.最可能的问题是程序是用一个 MPI 实现编译并用另一个 MPI 实现调用的。 Make sure that all MPI environment variables are set correctly: OPAL_PREFIX, MPI_ROOT, PATH, and LD_LIBRARY_PATH.
确保正确设置所有 MPI 环境变量:OPAL_PREFIX、MPI_ROOT、PATH 和 LD_LIBRARY_PATH。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.