简体   繁体   English

SLURM sbatch 脚本未在 while 循环中运行所有 srun 命令

[英]SLURM sbatch script not running all srun commands in while loop

I'm trying to submit multiple jobs in parallel as a preprocessing step in sbatch using srun.我正在尝试使用 srun 并行提交多个作业作为 sbatch 中的预处理步骤。 The loop reads a file containing 40 file names and uses "srun command" on each file.该循环读取一个包含 40 个文件名的文件,并对每个文件使用“srun 命令”。 However, not all files are being sent off with srun and the rest of the sbatch script continues after the ones that did get submitted finish.然而,并不是所有的文件都被 srun 发送出去,sbatch 脚本的其余部分在提交的文件完成后继续。 The real sbatch script is more complicated and I can't use arrays with this so that won't work.真正的 sbatch 脚本更复杂,我不能使用数组,因此不起作用。 This part should be pretty straightforward though.不过,这部分应该非常简单。

I made this simple test case as a sanity check and it does the same thing.我做了这个简单的测试用例作为健全性检查,它做同样的事情。 For every file name in the file list (40) it creates a new file containing 'foo' in it.对于文件列表 (40) 中的每个文件名,它都会创建一个包含“foo”的新文件。 Every time I submit the script with sbatch it results in a different number of files being sent off with srun.每次我使用 sbatch 提交脚本时,都会导致使用 srun 发送不同数量的文件。

#!/bin/sh
#SBATCH --job-name=loop
#SBATCH --nodes=5
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:10:00
#SBATCH --mem-per-cpu=1G
#SBATCH -A zheng_lab
#SBATCH -p exacloud
#SBATCH --error=/home/exacloud/lustre1/zheng_lab/users/eggerj/Dissertation/splice_net_prototype/beatAML_data/splicing_quantification/test_build_parallel/log_files/test.%J.err
#SBATCH --output=/home/exacloud/lustre1/zheng_lab/users/eggerj/Dissertation/splice_net_prototype/beatAML_data/splicing_quantification/test_build_parallel/log_files/test.%J.out

DIR=/home/exacloud/lustre1/zheng_lab/users/eggerj/Dissertation/splice_net_prototype/beatAML_data/splicing_quantification/test_build_parallel
SAMPLES=$DIR/samples.txt
OUT_DIR=$DIR/test_out
FOO_FILE=$DIR/foo.txt

# Create output directory
srun -N 1 -n 1 -c 1 mkdir $OUT_DIR

# How many files to run
num_files=$(srun -N 1 -n 1 -c 1 wc -l $SAMPLES)
echo "Number of input files: " $num_files

# Create a new file for every file in listing (run 5 at a time, 1 for each node)
while read F  ;
do
    fn="$(rev <<< "$F" | cut -d'/' -f 1 | rev)" # Remove path for writing output to new directory
    echo $fn
    srun -N 1 -n 1 -c 1 cat $FOO_FILE > $OUT_DIR/$fn.out &
done <$SAMPLES
wait

# How many files actually got created
finished=$(srun -N 1 -n 1 -c 1 ls -lh $OUT_DIR/*out | wc -l)
echo "Number of files submitted: " $finished

Here is my output log file the last time I tried to run it:这是我上次尝试运行它时的输出日志文件:

Number of input files:  40 /home/exacloud/lustre1/zheng_lab/users/eggerj/Dissertation/splice_net_prototype/beatAML_data/splicing_quantification/test_build_parallel/samples.txt
sample1
sample2
sample3
sample4
sample5
sample6
sample7
sample8
Number of files submitted:  8

The issue is that srun redirects its stdin to the tasks it starts, and therefore the contents of $SAMPLES is consumed, in an unpredictable way, by all the cat commands that are started.问题是srun将其stdin重定向到它启动的任务,因此$SAMPLES的内容以不可预测的方式被所有启动的cat命令消耗。

Try with试试

srun --input none -N 1 -n 1 -c 1 cat $FOO_FILE > $OUT_DIR/$fn.out &

The --input none parameter will tell srun to not mess with stdin . --input none参数将告诉srun不要与stdin --input none

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM