简体   繁体   English

如何使用许多工作人员发送slurm作业而不仅仅是在本地模式下运行?

[英]How to send a slurm job using many workers and not just running in local mode?

I want to run a python script using the command spark-submit on a slurm cluster using the commands srun and sbatch. 我想使用命令srun和sbatch在slurm集群上使用命令spark-submit运行python脚本。 When I run my current script it runs until the end and the end status is COMPLETED. 当我运行当前脚本时,它会一直运行,直到结束和结束状态为COMPLETED。 However, looking at the history-server of spark, I can see that all job ids are named "local...". 但是,查看spark的历史服务器,我可以看到所有工作ID都被命名为“local ...”。 When I check the environment variables, "spark.master" is always set to local[*]. 当我检查环境变量时,“spark.master”总是设置为local [*]。 I tried a lot of things and read a lot of documentation, but I could not found how to use multiple workers. 我尝试了很多东西并阅读了大量文档,但我找不到如何使用多个工作程序。

Here is my config: 这是我的配置:

#SBATCH --time=00:05:00
#SBATCH --nodes=4
#SBATCH --ntasks=4
#SBATCH --mem=4G
#SBATCH --cpus-per-task=8
#SBATCH --ntasks-per-node=1

module load spark/2.3.0
module load python/3.7

source ~/acc_env/bin/activate

export MKL_NUM_THREADS=1
export SPARK_IDENT_STRING=$SLURM_JOBID
export SPARK_WORKER_DIR=$SLURM_TMPDIR
export SLURM_SPARK_MEM=$(printf "%.0f" $((${SLURM_MEM_PER_NODE} *95/100)))

#start master
start-master.sh
sleep 20


MASTER_URL_STRING=$(grep -Po '(?=spark://).*' $SPARK_LOG_DIR/spark-${SPARK_IDENT_STRING}-org.apache.spark.deploy.master*.out)

IFS=' '
read -ra MASTER_URL <<< "$MASTER_URL_STRING"

echo "master url :" ${MASTER_URL}

NWORKERS=$((SLURM_NTASKS - 1))

and here are the commands I use to launch the workers and the script: 以下是我用来启动worker和脚本的命令:

SPARK_NO_DAEMONIZE=1 srun -n ${NWORKERS} -N ${NWORKERS} --label --output=$SPARK_LOG_DIR/spark-%j-workers.out start-slave.sh -m 4g -c ${SLURM_CPUS_PER_TASK} ${MASTER_URL} &
slaves_pid=$!
srun -n 1 -N 1 spark-submit main.py --master ${MASTER_URL} --executor-memory 4g

I found the answer. 我找到了答案。 I post it there if someone has the same problem in the future. 如果有人在将来遇到同样的问题,我会在那里发布。 The problem was the order in which I put the arguments in the srun spark-submit command. 问题是我将参数放在srun spark-submit命令中的顺序。 You must put the entry point program (main.py here) after the options because I don't know why but it seems that the arguments are discarded after the entry point argument. 你必须在选项之后放入入口点程序(main.py),因为我不知道为什么但似乎在入口点参数之后丢弃了参数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM