简体   繁体   中英

Is it possible to use wildcards in config files for a Snakemake pipeline?

I'm new to building Snakefiles and for my bioinformatics research, I'm trying to loop my rules over multiple samples. I looked for similar questions and answers, but I can't seem to fix this problem. It may be because I still don't really understand how Snakemake works exactly. If you guys can help me out that would be great.

At the moment I have multiple rules, which currently works for one sample:

# variables for every species
SAMPLE = "SRR8528338"
SAMPLES = "SRR8528338 SRR8528339 SRR8528340".split()

configfile: "./envs/contigs/" + SAMPLE + ".yaml"

var_variables = expand("results/4_mapped_contigs/" + SAMPLE + "/var/Contig{nr}_AT_sort.var", nr = config["contig_nrs"])
#make_contig_consensus = expand("results/5_consensus_contigs/{sample}", sample = SAMPLES)

rule all:
    input:
         var_variables
#        make_contig_consensus

rule convert_to_fBAM:
    input:
        "results/4_mapped_contigs/" + SAMPLE + "/sam/Contig{nr}_AT.sam"
    output:
        "results/4_mapped_contigs/" + SAMPLE + "/bam/Contig{nr}_AT.bam"
    shell:
        "samtools view -bS {input} > {output}"

rule sort_fBAM:
    input:
        "results/4_mapped_contigs/" + SAMPLE + "/bam/Contig{nr}_AT.bam"
    output:
        "results/4_mapped_contigs/" + SAMPLE + "/sorted_bam/Contig{nr}_AT_sort.bam"
    shell:
        "samtools sort -m5G {input} -o {output}"

rule convert_to_fpileup:
    input:
        "results/4_mapped_contigs/" + SAMPLE + "/sorted_bam/Contig{nr}_AT_sort.bam"
    output:
        "results/4_mapped_contigs/" + SAMPLE + "/pileup/Contig{nr}_AT_sort.pileup"
    shell:
        "samtools mpileup -B {input} > {output}"

rule SNP_calling:
    input:
        "results/4_mapped_contigs/" + SAMPLE + "/pileup/Contig{nr}_AT_sort.pileup"
    output:
        "results/4_mapped_contigs/" + SAMPLE + "/var/Contig{nr}_AT_sort.var"
    shell:
        "varscan pileup2cns {input} "
        "--min-freq-for-hom 0.6 "
        "--min-coverage 5 "
        "--min-var-freq 0.6 "
        "--p-value 0.1 "
        "--min-reads2 5 "
        "> {output}"

rule make_contig_consensus:
    input:
        "src/read_var.py"
    output:
        "results/5_consensus_contigs/{sample}"
    params:
        "{sample}"
    shell:
        "python3 {input} {params}"

The config file differs for every sample (the numbers of contigs). For SRR8528338, it looks like this:

contig_nrs:
    1: ./results/4_mapped_contigs/SRR8528338/var/Contig1_AT_sort.var
    2: ./results/4_mapped_contigs/SRR8528338/var/Contig2_AT_sort.var
    3: ./results/4_mapped_contigs/SRR8528338/var/Contig3_AT_sort.var
    ...
    2146: ./results/4_mapped_contigs/SRR8528338/var/Contig2146_AT_sort.var 

However, I want to loop all these rules over multiple samples as referred to in the "SAMPLES" variable. Now I tried using double braces before, which worked for multiple samples. (Changing all 'SAMPLES' to {{sample}} and adding: , sample = SAMPLES). Then my code should be looking like this:

# variables for every species
SAMPLES = "SRR8528338 SRR8528339 SRR8528340".split()

for sample in SAMPLES:
    configfile: "./envs/contigs/" + sample + ".yaml"

var_variables = expand("results/4_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var", sample = SAMPLES, nr = config["contig_nrs"])
make_contig_consensus = expand("results/5_consensus_contigs/{sample}", sample = SAMPLES)

rule all:
    input:
         var_variables
#        make_contig_consensus

rule convert_to_fBAM:
    input:
        "results/4_mapped_contigs/{{sample}}/sam/Contig{nr}_AT.sam"
    output:
        "results/4_mapped_contigs/{{sample}}/bam/Contig{nr}_AT.bam"
    shell:
        "samtools view -bS {input} > {output}"

rule sort_fBAM:
    input:
        "results/4_mapped_contigs/{{sample}}/bam/Contig{nr}_AT.bam"
    output:
        "results/4_mapped_contigs/{{sample}}/sorted_bam/Contig{nr}_AT_sort.bam"
    shell:
        "samtools sort -m5G {input} -o {output}"

rule convert_to_fpileup:
    input:
        "results/4_mapped_contigs/{{sample}}/sorted_bam/Contig{nr}_AT_sort.bam"
    output:
        "results/4_mapped_contigs/{{sample}}/pileup/Contig{nr}_AT_sort.pileup"
    shell:
        "samtools mpileup -B {input} > {output}"

rule SNP_calling:
    input:
        "results/4_mapped_contigs/{{sample}}/pileup/Contig{nr}_AT_sort.pileup"
    output:
        "results/4_mapped_contigs/{{sample}}/var/Contig{nr}_AT_sort.var"
    shell:
        "varscan pileup2cns {input} "
        "--min-freq-for-hom 0.6 "
        "--min-coverage 5 "
        "--min-var-freq 0.6 "
        "--p-value 0.1 "
        "--min-reads2 5 "
        "> {output}"

rule make_contig_consensus:
    input:
        "src/read_var.py"
    output:
        "results/5_consensus_contigs/{sample}"
    params:
        "{sample}"
    shell:
        "python3 {input} {params}"

However, when I run this I get an error. I'm not exactly sure, but I think it is because of the for loop (sample in SAMPLES):

Missing input files for rule all:
results/4_mapped_contigs/SRR8528338/var/Contig1266_AT_sort.var
results/4_mapped_contigs/SRR8528338/var/Contig1299_AT_sort.var
...

Now I was wondering: is there a way to expand the config file by using wildcards? Something like:

configfile: expand("./envs/contigs/{sample}.yaml", sample = SAMPLES)

Doing this will give me the error:

TypeError in line 4
expected str, bytes or os.PathLike object, not list

or do you have other solutions for this problem?

Thank you!


Update:

I've been trying some things out and I think it would be useful to change the config file into a nested dictionary instead of separate ones. It should look something like this:

    contigs:
         SRR8528336: - 1
                     - 2
                     - ...
                     - 2113
         SRR8528337: - 1
                      ...
          ...
    exons:
         SRR8528336: - 1
                      ...
                     - 1827
         SRR8528337: - 1
                       ...
                     - 1826
          ...

So for example, if I want to run for the samples: SRR8528338 until SRR8528340 I give this as input:

SAMPLES = "SRR8528338 SRR8528339 SRR8528340".split()

and call the contigs by sample name:

var_variables = expand("results/4_mapped_contigs/{{sample}}/var/Contig{nr}_AT_sort.var", nr = config["contigs"][wildcards.sample])

or exons by:

expand("results/7_exons/{{sample}}/var/exon{nr}_AT_sort.var", nr = config["exons"][wildcards.sample])

How does the 'wildcards.sample' works exactly if I only want to obtain the value?


Solution (and next problem) 31/7/2020

I made my changes according to bli, which is now working now:

# variables for every species
SAMPLES = "SRR8528347 SRR8528355 SRR8528356".split()

configfile: "./envs/config_contigs.yaml"

# bam = []
# sort_bam = []
# fpileup = []
var_variables = []
make_contig_consensus = []
blat_variables = []
extract_hits_psl = []
for sample in SAMPLES:
    contig_nrs = config[sample]
    for nr in contig_nrs:
        # bam.append("results/A04_mapped_contigs/{sample}/bam/Contig{nr}_AT.bam".format(sample=sample, nr=nr))
        # sort_bam.append("results/A04_mapped_contigs/{sample}/sorted_bam/Contig{nr}_AT_sort.bam".format(sample=sample, nr=nr))
        # fpileup.append("results/A04_mapped_contigs/{sample}/pileup/Contig{nr}_AT_sort.pileup".format(sample=sample, nr=nr))
        var_variables.append("results/A04_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var".format(sample=sample, nr=nr))
        make_contig_consensus.append("results/A05_consensus_contigs/{sample}/Contig{nr}.fasta".format(sample=sample, nr=nr))
        blat_variables.append("results/A06_identified_contigs_blat/{sample}/contig{nr}_AT.psl".format(sample=sample, nr=nr))
        extract_hits_psl.append("results/A07_mapped_exons/{sample}/".format(sample=sample, nr=nr))

rule all:
    input:
        # bam,
        # sort_bam,
        # fpileup,
        var_variables,
        make_contig_consensus,
        blat_variables,
        extract_hits_psl

rule convert_to_fBAM:
    input:
        "results/A04_mapped_contigs/{sample}/sam/Contig{nr}_AT.sam"
    output:
        "results/A04_mapped_contigs/{sample}/bam/Contig{nr}_AT.bam"
    shell:
        "samtools view -bS {input} > {output}"

rule sort_fBAM:
    input:
        "results/A04_mapped_contigs/{sample}/bam/Contig{nr}_AT.bam"
    output:
        "results/A04_mapped_contigs/{sample}/sorted_bam/Contig{nr}_AT_sort.bam"
    shell:
        "samtools sort -m5G {input} -o {output}"

rule convert_to_fpileup:
    input:
        "results/A04_mapped_contigs/{sample}/sorted_bam/Contig{nr}_AT_sort.bam"
    output:
        "results/A04_mapped_contigs/{sample}/pileup/Contig{nr}_AT_sort.pileup"
    shell:
        "samtools mpileup -B {input} > {output}"

rule SNP_calling:
    input:
        "results/A04_mapped_contigs/{sample}/pileup/Contig{nr}_AT_sort.pileup"
    output:
        "results/A04_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var"
    shell:
        "varscan pileup2cns {input} "
        "--min-freq-for-hom 0.6 "
        "--min-coverage 5 "
        "--min-var-freq 0.6 "
        "--p-value 0.1 "
        "--min-reads2 5 "
        "> {output}"

rule make_contig_consensus:
    input:
        script = "src/read_var.py",
        file = "results/A04_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var"
    output:
        "results/A05_consensus_contigs/{sample}/Contig{nr}.fasta"
    params:
        "{sample}"
    shell:
        "python3 {input.script} {params}"

rule BLAT_assembled:
    input:
        "data/exons/exons_AT.fasta",
        "results/A05_consensus_contigs/{sample}/Contig{nr}.fasta"
    output:
        "results/A06_identified_contigs_blat/{sample}/contig{nr}_AT.psl"
    shell:
        "blat "
        "-t=dnax "
        "-q=dnax "
        "-stepSize=5 "
        "-repMatch=2253 "
        "-minScore=0 "
        "-minIdentity=0 "
        "{input} {output}"

rule extract_hits_psl:
    input:
        script = "src/extract_hits_psl.py"
        # file = "results/A06_identified_contigs_blat/{sample}/contig{nr}_AT.psl"
    output:
        "results/A07_mapped_exons/{sample}/"
    params:
        "{sample}"
    shell:
        "python {input.script} {params}"

config_contigs.yaml:

SRR8528347:
    - 1
    - ...
    - 5
SRR8528348:
    - 1
    - ...
    - 5
...

Now calling them from the.yaml is working, but the rules should be run in the same order as written (from top to bottom). When running this, the rules are run in a different order and therefore gives an error because the files don't exist yet. I read that the output of the order before should be the same as the input after, but it is not working.

Here is an attempt at answering the question raised in your last edit (23/07/2020).

If your configuration file is as you show, you should have a config Python dict that contains a "contigs" entry, which is in turn a dict that can associate lists of numbers to sample identifiers.

You also generate a list of sample identifiers, and store it in the list SAMPLES .

To "drive" the pipeline, we decide which files will be the final goal, and that will be given as input of the all rule. Here, you want to generate a list of output files that follow this pattern:

"results/4_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var"

where the {sample} placeholders will take the values in the SAMPLE list, and where the {nr} placeholders will take values depending on the value of the {sample} placeholder.

(Note: I use "placeholder" instead of "wildcard" to help making it clear that there is no "magic" involved with the snakemake wildcards object here. The wildcards objects are defined in each rule instance, but here we are just preparing lists of files, in "plain" Python, outside of rules.)

Let's do it in plain Python, not using expand .

Start with an empty list:

var_variables = []

Now make a main loop for the samples:

for sample in SAMPLES:
    # Note: No `wildcards` here. We are not in a rule.
    # `sample` is just a (loop) variable in Python,
    # referring to one of the character strings in `SAMPLES`
    contig_nrs = config["contigs"][sample]
    for nr in contig_nrs:
        # I'm using python 3.6+ "formatted strings" to have the placeholders
        # replaced with the corresponding variable's values
        var_variables.append(f"results/4_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var")

And now, after this double loop is done, we can provide this list of files to the driving rule:

rule all:
    input:
         var_variables

There could be ways to use nested expand s and double curly braces syntax somewhere ( {{}} ) to do the same thing, but if you know some basics of Python (lists, dicts, strings...), you can do without it, and it may be simpler to understand. ( expand is an utility provided by Snakemake, it doesn't exist in standard Python.)

As for the wildcards objects, these will exist only inside rules, and the existing attributes will be determined by matching the pattern of the output section with the name of the actual file that the rule instance will have to generate.

For more on this, you can have a look at this proposal I had made to improve Snakemake's documentation: https://bitbucket.org/snakemake/snakemake/pull-requests/307/documentation-to-help-understand-expand/diff

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM