簡體   English   中英

嘗試在 slurm 集群上將 MPI 與司儀一起使用時出錯

[英]Error when trying to use MPI with emcee on a slurm cluster

嗨,感謝您的幫助!

我試圖讓司儀在 Slurm 集群上使用 mpi 運行,但是當我啟動我的代碼時,它在幾分鍾后返回一個錯誤,下面描述了一個大錯誤,這似乎與“無效的通信器”錯誤有關。

你知道我可能做錯了什么嗎?

我正在使用 anaconda,所以我嘗試重新安裝環境,更改使用的包,並刪除所有可能不需要的包,但錯誤始終相同。

這是我通過 sbatch 提交的腳本:

#!/bin/bash
#SBATCH --partition=largemem
#SBATCH --ntasks=40
#SBATCH --ntasks-per-node=40
#SBATCH --mem-per-cpu=4000
#SBATCH --mail-user=(my email)
#SBATCH --mail-type=ALL
#SBATCH --output=results/LastOpti.out
#SBATCH --error=results/LastOpti.err
#SBATCH --job-name=gal

source ~/anaconda3/etc/profile.d/conda.sh
conda activate EmceeMPI

cd ~/GalarioFitting

srun -n $SLURM_NTASKS python3 OptimizationGalarioMPI.py --nwalkers 560 --iterations 3000 --suffix _lasttest

conda deactivate

在我的 python 代碼中,我使用了 schwimmbad 的 MPIPool。

錯誤是這個大塊:

Traceback (most recent call last):
  File "OptimizationGalarioMPI.py", line 303, in <module>
    pos, prob, state = sampler.run_mcmc(pos, iterations, progress=True)
  File "/home/mbenisty/anaconda3/envs/EmceeMPI/lib/python3.7/site-packages/emcee-3.0rc2-py3.7.egg/emcee/ensemble.py", line 346, in run_mcmc
  File "/home/mbenisty/anaconda3/envs/EmceeMPI/lib/python3.7/site-packages/emcee-3.0rc2-py3.7.egg/emcee/ensemble.py", line 305, in sample
  File "/home/mbenisty/anaconda3/envs/EmceeMPI/lib/python3.7/site-packages/emcee-3.0rc2-py3.7.egg/emcee/moves/red_blue.py", line 92, in propose
  File "/home/mbenisty/anaconda3/envs/EmceeMPI/lib/python3.7/site-packages/emcee-3.0rc2-py3.7.egg/emcee/ensemble.py", line 389, in compute_log_prob
  File "/home/mbenisty/anaconda3/envs/EmceeMPI/lib/python3.7/site-packages/schwimmbad/mpi.py", line 168, in map
    status=status)
  File "mpi4py/MPI/Comm.pyx", line 1173, in mpi4py.MPI.Comm.recv
  File "mpi4py/MPI/msgpickle.pxi", line 302, in mpi4py.MPI.PyMPI_recv
  File "mpi4py/MPI/msgpickle.pxi", line 261, in mpi4py.MPI.PyMPI_recv_match
mpi4py.MPI.Exception: Invalid communicator, error stack:
PMPI_Mprobe(120):  MPI_Mprobe(source=-2, tag=-1, comm=MPI_COMM_WORLD, message=0x7ffed877b790, status=0x7ffed877b7a0)
PMPI_Mprobe(85).: Invalid communicator

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "OptimizationGalarioMPI.py", line 303, in <module>
    pos, prob, state = sampler.run_mcmc(pos, iterations, progress=True)
  File "/home/mbenisty/anaconda3/envs/EmceeMPI/lib/python3.7/site-packages/schwimmbad/pool.py", line 46, in __exit__
    self.close()
  File "/home/mbenisty/anaconda3/envs/EmceeMPI/lib/python3.7/site-packages/schwimmbad/mpi.py", line 188, in close
    self.comm.send(None, worker, 0)
  File "mpi4py/MPI/Comm.pyx", line 1156, in mpi4py.MPI.Comm.send
  File "mpi4py/MPI/msgpickle.pxi", line 174, in mpi4py.MPI.PyMPI_send

可能是 Conda 附帶的 MPI 實現不包括 Slurm 支持。 如果是這樣,您應該嘗試使用mpirun而不是srun來啟動您的程序。 但該錯誤通常表明多個 MPI 實現同時處於活動狀態。 確保提交作業時沒有加載環境模塊,並且沒有安裝與 MPI 相關的操作系統包。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM