简体   繁体   English

为什么我不能在SLURM中同时运行同一个python脚本的多个实例

[英]Why can't I run multiple instances of the same python script simulataniously in SLURM

I have been struggling trying to get multiple instances of a python script to run on SLURM. 我一直在努力尝试让python脚本的多个实例在SLURM上运行。 In my login node I have installed python3.6 and I have a python script "my_script.py" which takes a text file as input to read in run parameters. 在我的登录节点中,我已经安装了python3.6,并且具有一个python脚本“ my_script.py”,该脚本将文本文件作为输入来读取运行参数。 I can run this script on the login node using 我可以使用以下命令在登录节点上运行此脚本

python3.6 my_script.py input1.txt

Furthermore, I can submit a script submit.sh to run the job: 此外,我可以提交脚本Submit.sh来运行作业:

#!/bin/bash
#
#SBATCH --job-name=hostname_sleep_sample
#SBATCH --output=output1.txt
#SBATCH --cpus-per-task=1
#
#SBATCH --mem=2G

python3.6 my_script.py input1.txt

This runs fine and executes as expected. 这可以正常运行并按预期执行。 However, if I submit the following script: 但是,如果我提交以下脚本:

#!/bin/bash
#
#SBATCH --job-name=hostname_sleep_sample
#SBATCH --output=output2.txt
#SBATCH --cpus-per-task=1
#
#SBATCH --mem=2G

python3.6 my_script.py input2.txt

while the first is running I get the following error message in output2.txt: 当第一个运行时,我在output2.txt中收到以下错误消息:

/var/spool/slurmd/job00130/slurm_script: line 9: python3.6: command not 
found

I found that I have this same issue when I try to submit a job as an array. 当我尝试以数组形式提交作业时,我发现我也遇到了同样的问题。 For example, when I submit the following with sbatch: 例如,当我用sbatch提交以下内容时:

!/bin/bash
#
#SBATCH --job-name=hostname_sleep_sample 
#SBATCH --output=out_%j.txt
#SBATCH --array=1-10
#SBATCH --cpus-per-task=1
#
#SBATCH --mem=2G
echo PWD $PWD
cd $SLURM_SUBMIT_DIR
python3.6 my_script.py input_$SLURM_ARRAY_TASK_ID.txt
~  

I find that only out_1.txt shows that the job ran. 我发现只有out_1.txt表示作业已运行。 All of the output files for tasks 2-10 show the same error message: 任务2-10的所有输出文件都显示相同的错误消息:

/var/spool/slurmd/job00130/slurm_script: line 9: python3.6: command not 

I am running all of these scripts in an HPC cluster that I created using the Compute Engine API in the Google Cloud Platform. 我正在使用Google Cloud Platform中的Compute Engine API创建的HPC集群中运行所有这些脚本。 I used the following tutorial to set up the SLURM cluster: 我使用以下教程来设置SLURM集群:

https://codelabs.developers.google.com/codelabs/hpc-slurm-on-gcp/#0 https://codelabs.developers.google.com/codelabs/hpc-slurm-on-gcp/#0

Why is SLURM unable to run multiple python3.6 jobs at the same time and how can I get my array submission to work? 为什么SLURM无法同时运行多个python3.6作业,如何使阵列提交工作? I have spent days going through SLURM FAQs and other stack questions but I have not found out a way to resolve this issue or a suitable explanation of whats causing the issue in the first place. 我花了几天时间研究SLURM常见问题解答和其他堆栈问题,但是我一直没有找到解决此问题的方法,也没有找到对导致问题的原因的适当解释。

Thank you 谢谢

I found out what I was doing wrong. 我发现自己在做什么错。 I had created a cluster with two compute nodes, compute1 and compute2. 我创建了一个包含两个计算节点的群集,即compute1和compute2。 At some point when I was trying to get things to work I had submitted a job to compute1 with the following commands: 在某些时候,当我试图使事情正常工作时,我已经使用以下命令将作业提交给了compute1:

# Install Python 3.6
sudo yum -y install python36

# Install python-setuptools which will bring in easy_install
sudo yum -y install python36-setuptools

# Install pip using easy_install
sudo easy_install-3.6 pip

from the following post: 来自以下帖子:

How do I install python 3 on google cloud console? 如何在Google Cloud Console上安装python 3?

This had installed python3.6 on compute1 and that is why my jobs would run on compute1. 这已经在compute1上安装了python3.6,这就是为什么我的作业将在compute1上运行的原因。 However, I didn't think this script had run successfully I never submitted it to compute2, and therefore the jobs sent to compute2 could not call python3.6. 但是,我认为该脚本没有成功运行,所以我从未将其提交给compute2,因此发送到compute2的作业无法调用python3.6。 For some reason I thought Slurm was using python3.6 from the login node since I had sourced a path to it in my sbatch submission. 由于某种原因,我认为Slurm正在从登录节点使用python3.6,因为我已经在sbatch提交中找到了它的路径。

After installing python3.6 on cluster2 I was then able to import all of my locally installed python libraries based on the following link by including 在cluster2上安装python3.6之后,我可以基于以下链接导入所有本地安装的python库,包括

import sys
import os

sys.path.append(os.getcwd()) 

at the beginning of my python script. 在我的python脚本的开头。

How to import a local python module when using the sbatch command in SLURM 在SLURM中使用sbatch命令时如何导入本地python模块

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Slurm集群上的多个节点上运行MPI Python脚本? 错误:警告:无法在2个节点上运行1个进程,将nnode设置为1 - How To Run MPI Python Script across multiple nodes on Slurm cluster? Error: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 如何在具有 slurm id 的同一节点上运行具有多个输入的 python 代码? - How to run a python code with multiple inputs on a same node with slurm id? 为什么我无法使用python和optparser运行脚本 - Why can't I run script with python and optparser 为什么我不能在 python 中运行同一个类的方法? - Why can't i run a method from the same class in python? 如何禁用 python 脚本的多个实例? - how can I disable multiple instances of a python script? 如何在集群中通过 slurm 运行 python 脚本? - How to run a python script through slurm in a cluster? 同时运行多个python脚本实例 - run multiple instances of python script simultaneously 如何运行使用subprocess.call的同一Python脚本的多个实例 - How to run multiple instances of the same Python script which uses subprocess.call SLURM:如何为目录中的不同 $arg 并行运行相同的 python 脚本 - SLURM: how to run the same python script for different $arg from a catalogue in parallel 为什么我不能使用 for 循环初始化实例 - Python、class - Why can't I initialize instances using for loop - Python, class
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM