[英]How do I write a snakemake input when not all jobs successfully output files from previous rule?
Basically, I have three snakemake rules (other than rule all) and cannot figure this problem out, despite the checkpoint resources.基本上,我有三个蛇形规则(除了规则全部)并且无法解决这个问题,尽管有检查点资源。
Rule one has my first and only file that I start with.规则一有我开始的第一个也是唯一一个文件。 It will have x outputs (the number varies depending on the input file).它将有 x 个输出(数量因输入文件而异)。 Each of those x outputs needs to be processed separately in rule 2, meaning that rule 2 will run x jobs.这些 x 输出中的每一个都需要在规则 2 中单独处理,这意味着规则 2 将运行 x 个作业。 However, only some subset, y, of these jobs will produce outputs (the software only writes out files for inputs that pass a certain threshold).但是,这些作业中只有一些子集 y 会产生输出(软件只为通过某个阈值的输入写出文件)。 So, while I want each of those outputs to run as a separate job in job 3, I don't know how many files will come out of rule 2. Rule three will also run y jobs, one for each successful output from rule 2. I have two questions.因此,虽然我希望这些输出中的每一个在作业 3 中作为单独的作业运行,但我不知道规则 2 会产生多少文件。规则 3 还将运行 y 个作业,每个成功的 output 一个来自规则 2 .我有两个问题。 The first is how do I write the input for rule 3, not knowing how many files will come out of rule two?首先是如何编写规则 3 的输入,不知道规则 2 会产生多少文件? The second question is how can I "tell" rule 2 it is done, when there is not a corresponding number of output files to the input files?第二个问题是,当输入文件没有对应数量的 output 文件时,我如何“告诉”规则 2 已完成? If I add a fourth rule, I imagine it would try to re-run rule two on jobs that didn't get an output file, which would never make an output.如果我添加第四条规则,我想它会尝试在没有获得 output 文件的作业上重新运行规则二,这永远不会生成 output。 Maybe I am missing something with setting up the checkpoints?也许我在设置检查点时遗漏了一些东西?
something like:就像是:
rule a:
input: file.vcf
output: dummy.txt
shell:"""
.... make unknown number of output files (x) x_1 , x_2, ..., x_n
"""
#run a separate job from each output of rule a
rule b:
input: x_1 #not sure how many are going to be inputs here
output: y_1 #not sure how many output files will be here
shell:"""
some of the x inputs will output their corresponding y, but others will have no output
"""
#run a separate job for each output of rule b
rule c:
input: y_1 #not sure how many input files here
output: z_1
You should change rule a
to a checkpoint as mentioned in the comments.您应该将rule a
更改为评论中提到的检查点。 Rule b
will generate one output for each input and can be left as is, same for rule c
in this example. Rule b
将为每个输入生成一个 output 并且可以保持原样,在此示例中与rule c
相同。
Eventually, you will have an aggregate-like rule that will decide which outputs are required.最终,您将拥有一个类似聚合的规则来决定需要哪些输出。 It could be rule d, or it may end up being rule all.它可能是规则 d,也可能最终成为规则。 Either way, the aggregate rule needs an input function that invokes the checkpoint to determine which files are present.无论哪种方式,聚合规则都需要一个输入 function 调用检查点来确定存在哪些文件。 If you follow along the example , you would have something like:如果您按照示例进行操作,您将获得类似以下内容:
checkpoint a:
input: file.vcf
output: directory('output_dir')
shell:"""
mkdir {output} # then put all the output files here!
.... make unknown number of output files (x) x_1 , x_2, ..., x_n
"""
#run a separate job from each output of rule a
rule b:
input: output_dir/x_{n}
output: y_{n}
shell:"""
some of the x inputs will output their corresponding y, but others will have no output
"""
#run a separate job for each output of rule b
rule c:
input: y_{n}
output: z_{n}
# input function for the rule aggregate
def aggregate_input(wildcards):
checkpoint_output = checkpoints.a.get(**wildcards).output[0]
return expand("z_{i}",
i=glob_wildcards(os.path.join(checkpoint_output, "x_{i}.txt")).i)
rule aggregate: # what do you do with all the z files? could be all
input: aggregate_input
If you think of your workflow like a tree, rule a is branching with a variable number of branches.如果您将您的工作流程想象成一棵树,那么规则 a 会以可变数量的分支进行分支。 Rules b and c are one to one mappings.规则 b 和 c 是一对一的映射。 Aggregate brings all the branches back together AND is responsible for checking how many branches are present.聚合将所有分支重新组合在一起,并负责检查存在多少分支。 Rules b and c only see one input/output and don't care how many other branches there are.规则 b 和 c 只看到一个输入/输出,并不关心还有多少其他分支。
EDIT to answer the question in comment and fixed several bugs in my code:编辑以回答评论中的问题并修复了我的代码中的几个错误:
I still get confused here though, because rule b will not have as many outputs as inputs, so won't rule aggregate never run until all of the wildcards from the output of checkpoint a are present in z_{n}, which they never would be?我仍然在这里感到困惑,因为规则 b 的输出不会像输入一样多,所以在检查点 a 的 output 的所有通配符都存在于 z_{n} 中之前,规则聚合不会运行,它们永远不会运行是?
This is confusing because it's not how snakemake usually works and leads to a lot of questions on SO.这很令人困惑,因为它通常不是snakemake 的工作方式,并且会导致很多关于 SO 的问题。 What you need to remember is that when checkpoints.<rule>.get
is run, the evaluation for that step effectively pauses.您需要记住的是,当checkpoints.<rule>.get
运行时,该步骤的评估实际上会暂停。 Consider the simple case of three values for i == [1, 2, 3]
but only i == 2 and 3
are created in checkpoint a
.考虑i == [1, 2, 3]
的三个值的简单情况,但在checkpoint a
中仅创建i == 2 and 3
。 We know the DAG will look like this:我们知道 DAG 将如下所示:
rule file
input file.vcf
/ | \
a x_2 x_3
| |
b y_2 y_3
| |
c z_2 z_3
\ /
aggregate OUTPUT
Where x_1
is missing from checkpoint a
. checkpoint a
缺少x_1
的位置。 But, snakemake doesn't know how checkpoint a
will behave, just that it will make a directory as output and (because it is a checkpoint) once it is completed the DAG will be reevaluated.但是,snakemake 不知道checkpoint a
行为方式,只是它将创建一个目录为 output 并且(因为它是一个检查点)一旦完成,将重新评估 DAG。 So if you ran snakemake -nq
you would see checkpoint a
and aggregate
will run, but no mention of b
or c
.因此,如果您运行snakemake -nq
您会看到checkpoint a
和aggregate
将运行,但没有提及b
或c
。 At that point, those are the only rules snakemake knows about and plans to run.那时,这些是snakemake 知道并计划运行的唯一规则。 Calling checkpoint.<rule>.get
basically says "wait here, after this rule you will have to see what is made".调用checkpoint.<rule>.get
基本上是说“在这里等一下,在这条规则之后,你将不得不看看做了什么”。
So when snakemake first starts running your workflow, the DAG looks like this:所以当snakemake第一次开始运行你的工作流时,DAG看起来像这样:
rule file
input file.vcf
|
a ...
|
???? ...
|
aggregate OUTPUT
Snakemake doesn't know what goes between rule a
and aggregate
, just that it needs to run a
before it can tell. Snakemake 不知道 rule a
和aggregate
之间的关系,只是它需要运行a
才能知道。
rule file
input file.vcf
/ | \
a x_2 x_3
???? ...
|
aggregate OUTPUT
Checkpoint a
gets scheduled, run, and now the DAG is reevaluated.检查点a
被安排、运行,现在 DAG 被重新评估。 The rest of aggregate_input
looks at the files that are present with glob_wildcards
then uses that information to decide which files it needs. aggregate_input
的 rest 查看glob_wildcards
中存在的文件,然后使用该信息来决定它需要哪些文件。 Note that the expand is requesting outputs from rule c
, which requires rule b
, which requires x_{n}
, which exist now that the checkpoint has run.请注意,扩展请求来自rule c
的输出,该规则需要rule b
,该规则需要x_{n}
,现在检查点已经运行,这些规则已经存在。 Now snakemake can construct the DAG you expect.现在,snakemake 可以构建您期望的 DAG。
Here's the input function with more comments to hopefully make it clear:这是输入 function 以及更多评论,希望能清楚:
def aggregate_input(wildcards):
# say this rule depends on a checkpoint. DAG evaulation pauses here
checkpoint_output = checkpoints.a.get(**wildcards).output[0]
# at this point, checkpoint a has completed and the output (directory)
# is in checkpoint_output. Some number of files are there
# use glob_wildcards to find the x_{i} files that actually exist
found_files = glob_wildcards(os.path.join(checkpoint_output, "x_{i}.txt")).i
# now we know we need all the z files to be created *if* a x file exists.
return expand("z_{i}", i=found_files)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.