[英]Why can't I run multiple instances of the same python script simulataniously in SLURM
I have been struggling trying to get multiple instances of a python script to run on SLURM. 我一直在努力尝试让python脚本的多个实例在SLURM上运行。 In my login node I have installed python3.6 and I have a python script "my_script.py" which takes a text file as input to read in run parameters.
在我的登录节点中,我已经安装了python3.6,并且具有一个python脚本“ my_script.py”,该脚本将文本文件作为输入来读取运行参数。 I can run this script on the login node using
我可以使用以下命令在登录节点上运行此脚本
python3.6 my_script.py input1.txt
Furthermore, I can submit a script submit.sh to run the job: 此外,我可以提交脚本Submit.sh来运行作业:
#!/bin/bash
#
#SBATCH --job-name=hostname_sleep_sample
#SBATCH --output=output1.txt
#SBATCH --cpus-per-task=1
#
#SBATCH --mem=2G
python3.6 my_script.py input1.txt
This runs fine and executes as expected. 这可以正常运行并按预期执行。 However, if I submit the following script:
但是,如果我提交以下脚本:
#!/bin/bash
#
#SBATCH --job-name=hostname_sleep_sample
#SBATCH --output=output2.txt
#SBATCH --cpus-per-task=1
#
#SBATCH --mem=2G
python3.6 my_script.py input2.txt
while the first is running I get the following error message in output2.txt: 当第一个运行时,我在output2.txt中收到以下错误消息:
/var/spool/slurmd/job00130/slurm_script: line 9: python3.6: command not
found
I found that I have this same issue when I try to submit a job as an array. 当我尝试以数组形式提交作业时,我发现我也遇到了同样的问题。 For example, when I submit the following with sbatch:
例如,当我用sbatch提交以下内容时:
!/bin/bash
#
#SBATCH --job-name=hostname_sleep_sample
#SBATCH --output=out_%j.txt
#SBATCH --array=1-10
#SBATCH --cpus-per-task=1
#
#SBATCH --mem=2G
echo PWD $PWD
cd $SLURM_SUBMIT_DIR
python3.6 my_script.py input_$SLURM_ARRAY_TASK_ID.txt
~
I find that only out_1.txt shows that the job ran. 我发现只有out_1.txt表示作业已运行。 All of the output files for tasks 2-10 show the same error message:
任务2-10的所有输出文件都显示相同的错误消息:
/var/spool/slurmd/job00130/slurm_script: line 9: python3.6: command not
I am running all of these scripts in an HPC cluster that I created using the Compute Engine API in the Google Cloud Platform. 我正在使用Google Cloud Platform中的Compute Engine API创建的HPC集群中运行所有这些脚本。 I used the following tutorial to set up the SLURM cluster:
我使用以下教程来设置SLURM集群:
https://codelabs.developers.google.com/codelabs/hpc-slurm-on-gcp/#0 https://codelabs.developers.google.com/codelabs/hpc-slurm-on-gcp/#0
Why is SLURM unable to run multiple python3.6 jobs at the same time and how can I get my array submission to work? 为什么SLURM无法同时运行多个python3.6作业,如何使阵列提交工作? I have spent days going through SLURM FAQs and other stack questions but I have not found out a way to resolve this issue or a suitable explanation of whats causing the issue in the first place.
我花了几天时间研究SLURM常见问题解答和其他堆栈问题,但是我一直没有找到解决此问题的方法,也没有找到对导致问题的原因的适当解释。
Thank you 谢谢
I found out what I was doing wrong. 我发现自己在做什么错。 I had created a cluster with two compute nodes, compute1 and compute2.
我创建了一个包含两个计算节点的群集,即compute1和compute2。 At some point when I was trying to get things to work I had submitted a job to compute1 with the following commands:
在某些时候,当我试图使事情正常工作时,我已经使用以下命令将作业提交给了compute1:
# Install Python 3.6
sudo yum -y install python36
# Install python-setuptools which will bring in easy_install
sudo yum -y install python36-setuptools
# Install pip using easy_install
sudo easy_install-3.6 pip
from the following post: 来自以下帖子:
How do I install python 3 on google cloud console? 如何在Google Cloud Console上安装python 3?
This had installed python3.6 on compute1 and that is why my jobs would run on compute1. 这已经在compute1上安装了python3.6,这就是为什么我的作业将在compute1上运行的原因。 However, I didn't think this script had run successfully I never submitted it to compute2, and therefore the jobs sent to compute2 could not call python3.6.
但是,我认为该脚本没有成功运行,所以我从未将其提交给compute2,因此发送到compute2的作业无法调用python3.6。 For some reason I thought Slurm was using python3.6 from the login node since I had sourced a path to it in my sbatch submission.
由于某种原因,我认为Slurm正在从登录节点使用python3.6,因为我已经在sbatch提交中找到了它的路径。
After installing python3.6 on cluster2 I was then able to import all of my locally installed python libraries based on the following link by including 在cluster2上安装python3.6之后,我可以基于以下链接导入所有本地安装的python库,包括
import sys
import os
sys.path.append(os.getcwd())
at the beginning of my python script. 在我的python脚本的开头。
How to import a local python module when using the sbatch command in SLURM 在SLURM中使用sbatch命令时如何导入本地python模块
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.