简体   繁体   English

Snakemake 不会通过输入 function 推断通配符

[英]Snakemake doesn't infer wildcards with input function

I'm trying to write a snakemake workflow that where the wildcards are 'swapped' using an input function.我正在尝试编写一个使用输入 function “交换”通配符的蛇形工作流程。 Essentially, there are two conditions ('A' and 'B'), and a number of files are generated for each ( A/1.txt , A/2.txt , etc, B/1.txt , B/2.txt , etc) The number of files is always the same between the conditions, but unknown at the start of the workflow.本质上,有两个条件('A' 和 'B'),并且为每个条件生成多个文件( A/1.txtA/2.txt等、 B/1.txtB/2.txt等)文件的数量在条件之间总是相同的,但在工作流开始时是未知的。 There is an intermediate step, and then I want to use the intermediate files from one condition with the original files from the other condition.有一个中间步骤,然后我想将一个条件的中间文件与另一个条件的原始文件一起使用。 I wrote a simple snakefile that illustrates what I want to do:我写了一个简单的蛇文件来说明我想要做什么:

dirs = ("A", "B")

wildcard_constraints:
  DIR='|'.join(dirs),
  sample='\d+'

rule all:
    input:
        expand("{DIR}_done", DIR=dirs)

checkpoint create_files:
    output:
        directory("{DIR}/")
    shell:
        """
        mkdir {output};
          N=10;
        for D in $(seq $N); do
            touch {output}/$D.txt
        done
        """

rule intermediate:
  input:
    "{DIR}/{SAMPLE}.txt"
  output:
    intermediate = "{DIR}_intermediate/{SAMPLE}.txt"
  shell:
    "touch {intermediate}"

def swap_input(wildcards):
  
  if wildcards.DIR == 'A':
    return f'B_intermediate/{wildcards.SAMPLE}.txt'
    
  if wildcards.DIR == 'B':
    return f'A_intermediate/{wildcards.SAMPLE}.txt'

rule swap:
  input:
    original = "{DIR}/{SAMPLE}.txt",
    intermediate = swap_input
  output:
    swapped="{DIR}_swp/{SAMPLE}.txt"
  shell:
    "touch {output}"

def get_samples(wildcards):

    checkpoint_dir = checkpoints.create_files.get(**wildcards).output[0]

    samples = glob_wildcards(os.path.join(checkpoint_dir, "{sample}.txt")).sample
    return samples

rule done:
  input:
    lambda wildcards: expand("{DIR}_swp/{SAMPLE}.txt", SAMPLE=get_samples(wildcards), allow_missing=True)
  output:
    "{DIR}_done"
  shell:
    "touch {output}"

However, snakemake doesn't appear to be correctly inferring the wildcards.但是,snakemake 似乎没有正确推断通配符。 Perhaps I am misunderstanding something about the way snakemake infers wildcards.也许我误解了snakemake推断通配符的方式。 I get the error:我得到错误:

MissingInputException in rule intermediate in line 38 of Snakefile:
Missing input files for rule intermediate:
    output: A_intermediate/1.txt
    wildcards: DIR=A, SAMPLE=1
    affected files:
        A/1.txt

But the file A/1.txt should be created by the first rule create_files .但是文件A/1.txt应该由第一条规则create_files创建。

I thought perhaps this might be something to do with the checkpoint not being completed, but if I add checkpoint_dir = checkpoints.create_files.get(**wildcards).output[0] at the start of the input function swap_input , the error is still the same.我想这可能与检查点未完成有关,但如果我在输入 function swap_input的开头添加checkpoint_dir = checkpoints.create_files.get(**wildcards).output[0] ,错误仍然是相同。

Is there a way to get this workflow to work?有没有办法让这个工作流程发挥作用?

I managed to figure this out - the key was that the input to rule intermediate had to also be conditional on the checkpoint so it's only evaluated after the checkpoint completion (otherwise snakemake doesn't know about the files A/1.txt , etc).我设法弄清楚了这一点-关键是规则intermediate的输入也必须以检查点为条件,因此仅在检查点完成后才对其进行评估(否则snakemake不知道文件A/1.txt等) .

I also changed the shell directives for the rules so that we can check that the workflow is behaving as expected, and added the -p flag to mkdir in the first checkpoint as suggested by @Wayne.我还更改了规则的 shell 指令,以便我们可以检查工作流是否按预期运行,并按照@Wayne 的建议在第一个检查点中将-p标志添加到mkdir The final workflow looks like this:最终的工作流程如下所示:

dirs = ("A", "B")

wildcard_constraints:
  DIR='|'.join(dirs),
  sample='\d+'

rule all:
    input:
        expand("{DIR}_done", DIR=dirs)

checkpoint create_files:
  output:
    directory("{DIR}/")
  shell:
    """
    mkdir -p {output}
    N=10
    for D in $(seq $N); do
      let "NUM = $D + $RANDOM"
      echo $NUM > {output}/$D.txt
    done
    """

rule intermediate:
  input:
    lambda wildcards: os.path.join(checkpoints.create_files.get(**wildcards).output[0],
                                    f"{wildcards.SAMPLE}.txt")
  output:
    "{DIR}_intermediate/{SAMPLE}.txt"
  shell:
    "cp {input} {output}"


def swap_intermediate(wildcards):
  
  if wildcards.DIR == 'A':
    return f'B_intermediate/{wildcards.SAMPLE}.txt'
    
  if wildcards.DIR == 'B':
    return f'A_intermediate/{wildcards.SAMPLE}.txt'


rule swap:
  input:
    original = "{DIR}/{SAMPLE}.txt",
    intermediate = swap_intermediate
  output:
    swapped="{DIR}_swp/{SAMPLE}.txt"
  shell:
    "cat {input.original} {input.intermediate} > {output.swapped}"


def get_samples(wildcards):

    checkpoint_dir = checkpoints.create_files.get(**wildcards).output[0]

    samples = glob_wildcards(os.path.join(checkpoint_dir, "{sample}.txt")).sample
    return samples
    

rule done:
  input:
    lambda wildcards: expand("{DIR}_swp/{SAMPLE}.txt", SAMPLE=get_samples(wildcards), allow_missing=True)
  output:
    "{DIR}_done"
  shell:
    "touch {output}"

Your pattern in the output for create_files though doesn't suggest it is making files, just a directory, and so when it gets to rule intermediate , snakemake isn't seeing the association to the input you put for rule intermediate and the output of create_files .您在output for create_files中的模式虽然并不表明它正在制作文件,只是一个目录,因此当它进入 rule intermediate时,snakemake 看不到与您为 rule intermediate放置的输入和create_files的 output 的关联.

Also, the rule running mkdir each time was causing an error ( mkdir: cannot create directory 'B': File exists ) and so I separated that out.此外,每次运行 mkdir 的规则都会导致错误( mkdir: cannot create directory 'B': File exists ),因此我将其分开。 (Add ing -p flag to avoid that error lead to incorrect type call.) Feel like it could be combined correctly; (添加 ing -p标志以避免该错误导致不正确的类型调用。)感觉它可以正确组合; however, I wasn't coming up with the solution.但是,我没有想出解决方案。

Suggested:建议:

dirs = ("A", "B")

wildcard_constraints:
  DIR='|'.join(dirs),
  sample='\d+'
 
SAMPLE_NUMS = list(range(1,11))
#print(SAMPLE_NUMS)
#print(expand("{DIR}/{SAMPLE}.txt", SAMPLE=SAMPLE_NUMS, allow_missing=True))

rule all:
    input:
        expand("{DIR}_done", DIR=dirs)

checkpoint create_dir:
    output:
        directory("{DIR}/")
    shell:
        """
        mkdir {output}
        """

checkpoint create_files:
    output:
        expand("{DIR}/{SAMPLE}.txt", SAMPLE=SAMPLE_NUMS, allow_missing=True)
    shell:
        """
        touch {output}
        """

rule intermediate:
  input:
    "{DIR}/{SAMPLE}.txt"
  output:
    intermediate = "{DIR}_intermediate/{SAMPLE}.txt"
  shell:
    "touch {intermediate}"

def swap_input(wildcards):
  
  if wildcards.DIR == 'A':
    return f'B_intermediate/{wildcards.SAMPLE}.txt'
    
  if wildcards.DIR == 'B':
    return f'A_intermediate/{wildcards.SAMPLE}.txt'

rule swap:
  input:
    original = "{DIR}/{SAMPLE}.txt",
    intermediate = swap_input
  output:
    swapped="{DIR}_swp/{SAMPLE}.txt"
  shell:
    "touch {output}"

def get_samples(wildcards):

    checkpoint_dir = checkpoints.create_files.get(**wildcards).output[0]

    samples = glob_wildcards(os.path.join(checkpoint_dir, "{sample}.txt")).sample
    return samples

rule done:
  input:
    lambda wildcards: expand("{DIR}_swp/{SAMPLE}.txt", SAMPLE=get_samples(wildcards), allow_missing=True)
  output:
    "{DIR}_done"
  shell:
    "touch {output}"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM