[英]Error when trying to use MPI with emcee on a slurm cluster
嗨,感谢您的帮助!
我试图让司仪在 Slurm 集群上使用 mpi 运行,但是当我启动我的代码时,它在几分钟后返回一个错误,下面描述了一个大错误,这似乎与“无效的通信器”错误有关。
你知道我可能做错了什么吗?
我正在使用 anaconda,所以我尝试重新安装环境,更改使用的包,并删除所有可能不需要的包,但错误始终相同。
这是我通过 sbatch 提交的脚本:
#!/bin/bash
#SBATCH --partition=largemem
#SBATCH --ntasks=40
#SBATCH --ntasks-per-node=40
#SBATCH --mem-per-cpu=4000
#SBATCH --mail-user=(my email)
#SBATCH --mail-type=ALL
#SBATCH --output=results/LastOpti.out
#SBATCH --error=results/LastOpti.err
#SBATCH --job-name=gal
source ~/anaconda3/etc/profile.d/conda.sh
conda activate EmceeMPI
cd ~/GalarioFitting
srun -n $SLURM_NTASKS python3 OptimizationGalarioMPI.py --nwalkers 560 --iterations 3000 --suffix _lasttest
conda deactivate
在我的 python 代码中,我使用了 schwimmbad 的 MPIPool。
错误是这个大块:
Traceback (most recent call last):
File "OptimizationGalarioMPI.py", line 303, in <module>
pos, prob, state = sampler.run_mcmc(pos, iterations, progress=True)
File "/home/mbenisty/anaconda3/envs/EmceeMPI/lib/python3.7/site-packages/emcee-3.0rc2-py3.7.egg/emcee/ensemble.py", line 346, in run_mcmc
File "/home/mbenisty/anaconda3/envs/EmceeMPI/lib/python3.7/site-packages/emcee-3.0rc2-py3.7.egg/emcee/ensemble.py", line 305, in sample
File "/home/mbenisty/anaconda3/envs/EmceeMPI/lib/python3.7/site-packages/emcee-3.0rc2-py3.7.egg/emcee/moves/red_blue.py", line 92, in propose
File "/home/mbenisty/anaconda3/envs/EmceeMPI/lib/python3.7/site-packages/emcee-3.0rc2-py3.7.egg/emcee/ensemble.py", line 389, in compute_log_prob
File "/home/mbenisty/anaconda3/envs/EmceeMPI/lib/python3.7/site-packages/schwimmbad/mpi.py", line 168, in map
status=status)
File "mpi4py/MPI/Comm.pyx", line 1173, in mpi4py.MPI.Comm.recv
File "mpi4py/MPI/msgpickle.pxi", line 302, in mpi4py.MPI.PyMPI_recv
File "mpi4py/MPI/msgpickle.pxi", line 261, in mpi4py.MPI.PyMPI_recv_match
mpi4py.MPI.Exception: Invalid communicator, error stack:
PMPI_Mprobe(120): MPI_Mprobe(source=-2, tag=-1, comm=MPI_COMM_WORLD, message=0x7ffed877b790, status=0x7ffed877b7a0)
PMPI_Mprobe(85).: Invalid communicator
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "OptimizationGalarioMPI.py", line 303, in <module>
pos, prob, state = sampler.run_mcmc(pos, iterations, progress=True)
File "/home/mbenisty/anaconda3/envs/EmceeMPI/lib/python3.7/site-packages/schwimmbad/pool.py", line 46, in __exit__
self.close()
File "/home/mbenisty/anaconda3/envs/EmceeMPI/lib/python3.7/site-packages/schwimmbad/mpi.py", line 188, in close
self.comm.send(None, worker, 0)
File "mpi4py/MPI/Comm.pyx", line 1156, in mpi4py.MPI.Comm.send
File "mpi4py/MPI/msgpickle.pxi", line 174, in mpi4py.MPI.PyMPI_send
可能是 Conda 附带的 MPI 实现不包括 Slurm 支持。 如果是这样,您应该尝试使用mpirun
而不是srun
来启动您的程序。 但该错误通常表明多个 MPI 实现同时处于活动状态。 确保提交作业时没有加载环境模块,并且没有安装与 MPI 相关的操作系统包。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.