当不是所有作业都成功 output 文件时，我如何编写一个蛇形输入？

Question

基本上，我有三个蛇形规则（除了规则全部）并且无法解决这个问题，尽管有检查点资源。

规则一有我开始的第一个也是唯一一个文件。 它将有 x 个输出（数量因输入文件而异）。 这些 x 输出中的每一个都需要在规则 2 中单独处理，这意味着规则 2 将运行 x 个作业。 但是，这些作业中只有一些子集 y 会产生输出（软件只为通过某个阈值的输入写出文件）。 因此，虽然我希望这些输出中的每一个在作业 3 中作为单独的作业运行，但我不知道规则 2 会产生多少文件。规则 3 还将运行 y 个作业，每个成功的 output 一个来自规则 2 .我有两个问题。 首先是如何编写规则 3 的输入，不知道规则 2 会产生多少文件？ 第二个问题是，当输入文件没有对应数量的 output 文件时，我如何“告诉”规则 2 已完成？ 如果我添加第四条规则，我想它会尝试在没有获得 output 文件的作业上重新运行规则二，这永远不会生成 output。 也许我在设置检查点时遗漏了一些东西？

就像是：

rule a:
     input: file.vcf
     output: dummy.txt
     shell:"""
      .... make unknown number of output files (x) x_1 , x_2, ..., x_n 
           """ 
#run a separate job from each output of rule a
rule b:
     input: x_1 #not sure how many are going to be inputs here
     output: y_1 #not sure how many output files will be here
     shell:"""
           some of the x inputs will output their corresponding y, but others will have no output
           """
#run a separate job for each output of rule b
rule c:
     input: y_1 #not sure how many input files here
     output: z_1

Answer 1

您应该将rule a更改为评论中提到的检查点。 Rule b将为每个输入生成一个 output 并且可以保持原样，在此示例中与rule c相同。

最终，您将拥有一个类似聚合的规则来决定需要哪些输出。 它可能是规则 d，也可能最终成为规则。 无论哪种方式，聚合规则都需要一个输入 function 调用检查点来确定存在哪些文件。 如果您按照示例进行操作，您将获得类似以下内容：

checkpoint a:
     input: file.vcf
     output: directory('output_dir')
     shell:"""
           mkdir {output}  # then put all the output files here!
      .... make unknown number of output files (x) x_1 , x_2, ..., x_n 
           """ 
#run a separate job from each output of rule a
rule b:
     input: output_dir/x_{n}
     output: y_{n}
     shell:"""
           some of the x inputs will output their corresponding y, but others will have no output
           """
#run a separate job for each output of rule b
rule c:
     input: y_{n}
     output: z_{n}

# input function for the rule aggregate
def aggregate_input(wildcards):
    checkpoint_output = checkpoints.a.get(**wildcards).output[0]
    return expand("z_{i}",
           i=glob_wildcards(os.path.join(checkpoint_output, "x_{i}.txt")).i)

rule aggregate:  # what do you do with all the z files?  could be all
    input: aggregate_input

如果您将您的工作流程想象成一棵树，那么规则 a 会以可变数量的分支进行分支。 规则 b 和 c 是一对一的映射。 聚合将所有分支重新组合在一起，并负责检查存在多少分支。 规则 b 和 c 只看到一个输入/输出，并不关心还有多少其他分支。

编辑以回答评论中的问题并修复了我的代码中的几个错误：

我仍然在这里感到困惑，因为规则 b 的输出不会像输入一样多，所以在检查点 a 的 output 的所有通配符都存在于 z_{n} 中之前，规则聚合不会运行，它们永远不会运行是？

这很令人困惑，因为它通常不是snakemake 的工作方式，并且会导致很多关于 SO 的问题。 您需要记住的是，当checkpoints.<rule>.get运行时，该步骤的评估实际上会暂停。 考虑i == [1, 2, 3]的三个值的简单情况，但在checkpoint a中仅创建i == 2 and 3 。 我们知道 DAG 将如下所示：

rule             file
input           file.vcf
             /     |     \
a                 x_2    x_3
                   |      |
b                 y_2    y_3
                   |      |
c                 z_2    z_3
                    \     /
aggregate           OUTPUT

checkpoint a缺少x_1的位置。 但是，snakemake 不知道checkpoint a行为方式，只是它将创建一个目录为 output 并且（因为它是一个检查点）一旦完成，将重新评估 DAG。 因此，如果您运行snakemake -nq您会看到checkpoint a和aggregate将运行，但没有提及b或c 。 那时，这些是snakemake 知道并计划运行的唯一规则。 调用checkpoint.<rule>.get基本上是说“在这里等一下，在这条规则之后，你将不得不看看做了什么”。

所以当snakemake第一次开始运行你的工作流时，DAG看起来像这样：

rule             file
input           file.vcf
                   |     
a                 ...
                   |     
????              ...
                   |     
aggregate        OUTPUT

Snakemake 不知道 rule a和aggregate之间的关系，只是它需要运行a才能知道。

rule             file
input           file.vcf
             /     |     \
a                 x_2    x_3
                        
????              ...
                   |     
aggregate        OUTPUT

检查点a被安排、运行，现在 DAG 被重新评估。 aggregate_input的 rest 查看glob_wildcards中存在的文件，然后使用该信息来决定它需要哪些文件。 请注意，扩展请求来自rule c的输出，该规则需要rule b ，该规则需要x_{n} ，现在检查点已经运行，这些规则已经存在。 现在，snakemake 可以构建您期望的 DAG。

这是输入 function 以及更多评论，希望能清楚：

def aggregate_input(wildcards):
    # say this rule depends on a checkpoint.  DAG evaulation pauses here
    checkpoint_output = checkpoints.a.get(**wildcards).output[0]
    # at this point, checkpoint a has completed and the output (directory)
    # is in checkpoint_output.  Some number of files are there

    # use glob_wildcards to find the x_{i} files that actually exist
    found_files = glob_wildcards(os.path.join(checkpoint_output, "x_{i}.txt")).i
    # now we know we need all the z files to be created *if* a x file exists.
    return expand("z_{i}", i=found_files)

当不是所有作业都成功 output 文件时，我如何编写一个蛇形输入？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-03-09 14:28:47

当不是所有作业都成功 output 文件时，我如何编写一个蛇形输入？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-03-09 14:28:47

解决方案1
1 已采纳 2021-03-09 14:28:47