为什么我不能在SLURM中同时运行同一个python脚本的多个实例

Question

I have been struggling trying to get multiple instances of a python script to run on SLURM. 我一直在努力尝试让python脚本的多个实例在SLURM上运行。 In my login node I have installed python3.6 and I have a python script "my_script.py" which takes a text file as input to read in run parameters. 在我的登录节点中，我已经安装了python3.6，并且具有一个python脚本“ my_script.py”，该脚本将文本文件作为输入来读取运行参数。 I can run this script on the login node using 我可以使用以下命令在登录节点上运行此脚本

python3.6 my_script.py input1.txt

Furthermore, I can submit a script submit.sh to run the job: 此外，我可以提交脚本Submit.sh来运行作业：

#!/bin/bash
#
#SBATCH --job-name=hostname_sleep_sample
#SBATCH --output=output1.txt
#SBATCH --cpus-per-task=1
#
#SBATCH --mem=2G

python3.6 my_script.py input1.txt

This runs fine and executes as expected. 这可以正常运行并按预期执行。 However, if I submit the following script: 但是，如果我提交以下脚本：

#!/bin/bash
#
#SBATCH --job-name=hostname_sleep_sample
#SBATCH --output=output2.txt
#SBATCH --cpus-per-task=1
#
#SBATCH --mem=2G

python3.6 my_script.py input2.txt

while the first is running I get the following error message in output2.txt: 当第一个运行时，我在output2.txt中收到以下错误消息：

/var/spool/slurmd/job00130/slurm_script: line 9: python3.6: command not 
found

I found that I have this same issue when I try to submit a job as an array. 当我尝试以数组形式提交作业时，我发现我也遇到了同样的问题。 For example, when I submit the following with sbatch: 例如，当我用sbatch提交以下内容时：

!/bin/bash
#
#SBATCH --job-name=hostname_sleep_sample 
#SBATCH --output=out_%j.txt
#SBATCH --array=1-10
#SBATCH --cpus-per-task=1
#
#SBATCH --mem=2G
echo PWD $PWD
cd $SLURM_SUBMIT_DIR
python3.6 my_script.py input_$SLURM_ARRAY_TASK_ID.txt
~

I find that only out_1.txt shows that the job ran. 我发现只有out_1.txt表示作业已运行。 All of the output files for tasks 2-10 show the same error message: 任务2-10的所有输出文件都显示相同的错误消息：

/var/spool/slurmd/job00130/slurm_script: line 9: python3.6: command not

I am running all of these scripts in an HPC cluster that I created using the Compute Engine API in the Google Cloud Platform. 我正在使用Google Cloud Platform中的Compute Engine API创建的HPC集群中运行所有这些脚本。 I used the following tutorial to set up the SLURM cluster: 我使用以下教程来设置SLURM集群：

https://codelabs.developers.google.com/codelabs/hpc-slurm-on-gcp/#0 https://codelabs.developers.google.com/codelabs/hpc-slurm-on-gcp/#0

Why is SLURM unable to run multiple python3.6 jobs at the same time and how can I get my array submission to work? 为什么SLURM无法同时运行多个python3.6作业，如何使阵列提交工作？ I have spent days going through SLURM FAQs and other stack questions but I have not found out a way to resolve this issue or a suitable explanation of whats causing the issue in the first place. 我花了几天时间研究SLURM常见问题解答和其他堆栈问题，但是我一直没有找到解决此问题的方法，也没有找到对导致问题的原因的适当解释。

Thank you 谢谢

Answer 1

I found out what I was doing wrong. 我发现自己在做什么错。 I had created a cluster with two compute nodes, compute1 and compute2. 我创建了一个包含两个计算节点的群集，即compute1和compute2。 At some point when I was trying to get things to work I had submitted a job to compute1 with the following commands: 在某些时候，当我试图使事情正常工作时，我已经使用以下命令将作业提交给了compute1：

# Install Python 3.6
sudo yum -y install python36

# Install python-setuptools which will bring in easy_install
sudo yum -y install python36-setuptools

# Install pip using easy_install
sudo easy_install-3.6 pip

from the following post: 来自以下帖子：

How do I install python 3 on google cloud console? 如何在Google Cloud Console上安装python 3？

This had installed python3.6 on compute1 and that is why my jobs would run on compute1. 这已经在compute1上安装了python3.6，这就是为什么我的作业将在compute1上运行的原因。 However, I didn't think this script had run successfully I never submitted it to compute2, and therefore the jobs sent to compute2 could not call python3.6. 但是，我认为该脚本没有成功运行，所以我从未将其提交给compute2，因此发送到compute2的作业无法调用python3.6。 For some reason I thought Slurm was using python3.6 from the login node since I had sourced a path to it in my sbatch submission. 由于某种原因，我认为Slurm正在从登录节点使用python3.6，因为我已经在sbatch提交中找到了它的路径。

After installing python3.6 on cluster2 I was then able to import all of my locally installed python libraries based on the following link by including 在cluster2上安装python3.6之后，我可以基于以下链接导入所有本地安装的python库，包括

import sys
import os

sys.path.append(os.getcwd())

at the beginning of my python script. 在我的python脚本的开头。

How to import a local python module when using the sbatch command in SLURM 在SLURM中使用sbatch命令时如何导入本地python模块

为什么我不能在SLURM中同时运行同一个python脚本的多个实例

问题描述

1 个解决方案

解决方案1
0 2018-10-25 01:12:22

为什么我不能在SLURM中同时运行同一个python脚本的多个实例

问题描述

1 个解决方案

解决方案1 0 2018-10-25 01:12:22

解决方案1
0 2018-10-25 01:12:22