简体   繁体   中英

Snakemake doesn't infer wildcards with input function

I'm trying to write a snakemake workflow that where the wildcards are 'swapped' using an input function. Essentially, there are two conditions ('A' and 'B'), and a number of files are generated for each ( A/1.txt , A/2.txt , etc, B/1.txt , B/2.txt , etc) The number of files is always the same between the conditions, but unknown at the start of the workflow. There is an intermediate step, and then I want to use the intermediate files from one condition with the original files from the other condition. I wrote a simple snakefile that illustrates what I want to do:

dirs = ("A", "B")

wildcard_constraints:
  DIR='|'.join(dirs),
  sample='\d+'

rule all:
    input:
        expand("{DIR}_done", DIR=dirs)

checkpoint create_files:
    output:
        directory("{DIR}/")
    shell:
        """
        mkdir {output};
          N=10;
        for D in $(seq $N); do
            touch {output}/$D.txt
        done
        """

rule intermediate:
  input:
    "{DIR}/{SAMPLE}.txt"
  output:
    intermediate = "{DIR}_intermediate/{SAMPLE}.txt"
  shell:
    "touch {intermediate}"

def swap_input(wildcards):
  
  if wildcards.DIR == 'A':
    return f'B_intermediate/{wildcards.SAMPLE}.txt'
    
  if wildcards.DIR == 'B':
    return f'A_intermediate/{wildcards.SAMPLE}.txt'

rule swap:
  input:
    original = "{DIR}/{SAMPLE}.txt",
    intermediate = swap_input
  output:
    swapped="{DIR}_swp/{SAMPLE}.txt"
  shell:
    "touch {output}"

def get_samples(wildcards):

    checkpoint_dir = checkpoints.create_files.get(**wildcards).output[0]

    samples = glob_wildcards(os.path.join(checkpoint_dir, "{sample}.txt")).sample
    return samples

rule done:
  input:
    lambda wildcards: expand("{DIR}_swp/{SAMPLE}.txt", SAMPLE=get_samples(wildcards), allow_missing=True)
  output:
    "{DIR}_done"
  shell:
    "touch {output}"

However, snakemake doesn't appear to be correctly inferring the wildcards. Perhaps I am misunderstanding something about the way snakemake infers wildcards. I get the error:

MissingInputException in rule intermediate in line 38 of Snakefile:
Missing input files for rule intermediate:
    output: A_intermediate/1.txt
    wildcards: DIR=A, SAMPLE=1
    affected files:
        A/1.txt

But the file A/1.txt should be created by the first rule create_files .

I thought perhaps this might be something to do with the checkpoint not being completed, but if I add checkpoint_dir = checkpoints.create_files.get(**wildcards).output[0] at the start of the input function swap_input , the error is still the same.

Is there a way to get this workflow to work?

I managed to figure this out - the key was that the input to rule intermediate had to also be conditional on the checkpoint so it's only evaluated after the checkpoint completion (otherwise snakemake doesn't know about the files A/1.txt , etc).

I also changed the shell directives for the rules so that we can check that the workflow is behaving as expected, and added the -p flag to mkdir in the first checkpoint as suggested by @Wayne. The final workflow looks like this:

dirs = ("A", "B")

wildcard_constraints:
  DIR='|'.join(dirs),
  sample='\d+'

rule all:
    input:
        expand("{DIR}_done", DIR=dirs)

checkpoint create_files:
  output:
    directory("{DIR}/")
  shell:
    """
    mkdir -p {output}
    N=10
    for D in $(seq $N); do
      let "NUM = $D + $RANDOM"
      echo $NUM > {output}/$D.txt
    done
    """

rule intermediate:
  input:
    lambda wildcards: os.path.join(checkpoints.create_files.get(**wildcards).output[0],
                                    f"{wildcards.SAMPLE}.txt")
  output:
    "{DIR}_intermediate/{SAMPLE}.txt"
  shell:
    "cp {input} {output}"


def swap_intermediate(wildcards):
  
  if wildcards.DIR == 'A':
    return f'B_intermediate/{wildcards.SAMPLE}.txt'
    
  if wildcards.DIR == 'B':
    return f'A_intermediate/{wildcards.SAMPLE}.txt'


rule swap:
  input:
    original = "{DIR}/{SAMPLE}.txt",
    intermediate = swap_intermediate
  output:
    swapped="{DIR}_swp/{SAMPLE}.txt"
  shell:
    "cat {input.original} {input.intermediate} > {output.swapped}"


def get_samples(wildcards):

    checkpoint_dir = checkpoints.create_files.get(**wildcards).output[0]

    samples = glob_wildcards(os.path.join(checkpoint_dir, "{sample}.txt")).sample
    return samples
    

rule done:
  input:
    lambda wildcards: expand("{DIR}_swp/{SAMPLE}.txt", SAMPLE=get_samples(wildcards), allow_missing=True)
  output:
    "{DIR}_done"
  shell:
    "touch {output}"

Your pattern in the output for create_files though doesn't suggest it is making files, just a directory, and so when it gets to rule intermediate , snakemake isn't seeing the association to the input you put for rule intermediate and the output of create_files .

Also, the rule running mkdir each time was causing an error ( mkdir: cannot create directory 'B': File exists ) and so I separated that out. (Add ing -p flag to avoid that error lead to incorrect type call.) Feel like it could be combined correctly; however, I wasn't coming up with the solution.

Suggested:

dirs = ("A", "B")

wildcard_constraints:
  DIR='|'.join(dirs),
  sample='\d+'
 
SAMPLE_NUMS = list(range(1,11))
#print(SAMPLE_NUMS)
#print(expand("{DIR}/{SAMPLE}.txt", SAMPLE=SAMPLE_NUMS, allow_missing=True))

rule all:
    input:
        expand("{DIR}_done", DIR=dirs)

checkpoint create_dir:
    output:
        directory("{DIR}/")
    shell:
        """
        mkdir {output}
        """

checkpoint create_files:
    output:
        expand("{DIR}/{SAMPLE}.txt", SAMPLE=SAMPLE_NUMS, allow_missing=True)
    shell:
        """
        touch {output}
        """

rule intermediate:
  input:
    "{DIR}/{SAMPLE}.txt"
  output:
    intermediate = "{DIR}_intermediate/{SAMPLE}.txt"
  shell:
    "touch {intermediate}"

def swap_input(wildcards):
  
  if wildcards.DIR == 'A':
    return f'B_intermediate/{wildcards.SAMPLE}.txt'
    
  if wildcards.DIR == 'B':
    return f'A_intermediate/{wildcards.SAMPLE}.txt'

rule swap:
  input:
    original = "{DIR}/{SAMPLE}.txt",
    intermediate = swap_input
  output:
    swapped="{DIR}_swp/{SAMPLE}.txt"
  shell:
    "touch {output}"

def get_samples(wildcards):

    checkpoint_dir = checkpoints.create_files.get(**wildcards).output[0]

    samples = glob_wildcards(os.path.join(checkpoint_dir, "{sample}.txt")).sample
    return samples

rule done:
  input:
    lambda wildcards: expand("{DIR}_swp/{SAMPLE}.txt", SAMPLE=get_samples(wildcards), allow_missing=True)
  output:
    "{DIR}_done"
  shell:
    "touch {output}"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM