简体   繁体   English

如何立即将所有 Snakemake 作业提交到 slurm 集群

[英]How to immediately submit all Snakemake jobs to slurm cluster

I'm using snakemake to build a variant calling pipeline that can be run on a SLURM cluster.我正在使用snakemake构建一个可以在 SLURM 集群上运行的变体调用管道。 The cluster has login nodes and compute nodes.集群有登录节点和计算节点。 Any real computing should be done on the compute nodes in the form of an srun or sbatch job.任何真正的计算都应该以srunsbatch作业的形式在计算节点上完成。 Jobs are limited to 48 hours of runtime.作业的运行时间限制为 48 小时。 My problem is that processing many samples, especially when the queue is busy, will take more than 48 hours to process all the rules for every sample.我的问题是,处理很多样本,尤其是在队列繁忙的情况下,处理每个样本的所有规则需要超过 48 小时。 The traditional cluster execution for snakemake leaves a master thread running that only submits rules to the queue after all the rule's dependencies have finished running.传统的snakemake集群执行snakemake留下一个主线程运行,该线程仅在所有规则的依赖项运行完毕后才将规则提交到队列。 I'm supposed to run this master program on a compute node, so this limits the runtime of my entire pipeline to 48 hours.我应该在计算节点上运行这个主程序,所以这将我的整个管道的运行时间限制为 48 小时。

I know SLURM jobs have dependency directives that tell a job to wait to run until other jobs have finished.我知道 SLURM 作业有依赖指令,告诉作业等待运行,直到其他作业完成。 Because the snakemake workflow is a DAG, is it possible to submit all the jobs at once, with each job having its dependencies defined by the rule's dependencies from the DAG?由于snakemake工作流是一个 DAG,是否可以一次提交所有作业,每个作业的依赖项都由 DAG 的规则依赖项定义? After all the jobs are submitted the master thread would complete, circumventing the 48 hour limit.提交所有作业后,主线程将完成,绕过 48 小时限制。 Is this possible with snakemake , and if so, how does it work?这是否可能与snakemake ,如果是这样,它是如何工作的? I've found the --immediate-submit command line option, but I'm not sure if this has the behavior I'm looking for and how to use the command because my cluster prints Submitted batch job [id] after a job is submitted to the queue instead of just the job id.我找到了--immediate-submit命令行选项,但我不确定这是否具有我正在寻找的行为以及如何使用该命令,因为我的集群在作业完成后打印Submitted batch job [id]提交到队列,而不仅仅是作业 ID。

Immediate submit unfortunately does not work out-of-the-box , but needs some tuning for it to work.不幸的是,立即提交不能开箱即用,但需要进行一些调整才能工作。 This is because the way dependencies between jobs are passed along differ between cluster systems.这是因为作业之间的依赖关系在集群系统之间传递的方式不同。 A while ago I struggled with the same problem.不久前,我遇到了同样的问题。 As the immediate-submit docs say:正如立即提交文档所说:

Immediately submit all jobs to the cluster instead of waiting for present input files.立即将所有作业提交到集群,而不是等待当前的输入文件。 This will fail, unless you make the cluster aware of job dependencies, eg via: $ snakemake –cluster 'sbatch –dependency {dependencies}.这将失败,除非您让集群知道作业依赖关系,例如通过: $ snakemake –cluster 'sbatch –dependency {dependencies}。 Assuming that your submit script (here sbatch) outputs the generated job id to the first stdout line, {dependencies} will be filled with space separated job ids this job depends on.假设您的提交脚本(此处为 sbatch)将生成的作业 ID 输出到第一个 stdout 行,{dependencies} 将填充此作业所依赖的以空格分隔的作业 ID。

So the problem is that sbatch does not output the generated job id to the first stdout line.所以问题是sbatch不会将生成的作业 id 输出到第一个 stdout 行。 However we can circumvent this with our own shell script:但是,我们可以使用我们自己的 shell 脚本来规避这一点:

parseJobID.sh: parseJobID.sh:

#!/bin/bash
# helper script that parses slurm output for the job ID,
# and feeds it to back to snakemake/slurm for dependencies.
# This is required when you want to use the snakemake --immediate-submit option

if [[ "Submitted batch job" =~ "$@" ]]; then
  echo -n ""
else
  deplist=$(grep -Eo '[0-9]{1,10}' <<< "$@" | tr '\n' ',' | sed 's/.$//')
  echo -n "--dependency=aftercorr:$deplist"
fi;

And make sure to give the script execute permission with chmod +x parseJobID.sh .并确保使用chmod +x parseJobID.sh授予脚本执行权限。

We can then call immediate submit like this:然后我们可以像这样调用立即提交:

snakemake --cluster 'sbatch $(./parseJobID.sh {dependencies})' --jobs 100 --notemp --immediate-submit

Note that this will submit at max 100 jobs at the same time.请注意,这将同时提交最多 100 个作业。 You can increase or decrease this to any number you like, but know that most cluster systems do not allow more than 1000 jobs per user at the same time.您可以将其增加或减少到您喜欢的任何数量,但要知道大多数集群系统不允许每个用户同时处理超过 1000 个作业。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM