简体   繁体   English

蛇形中的通配符问题

[英]Trouble with wildcards in snakemake

I am having trouble with wildcards not converting to the supposed values.我在通配符无法转换为假定值时遇到问题。 This is the Snakefile:这是 Snakefile:

import pandas as pd

configfile: "config.json"
experiments = pd.read_csv(config["experiments"], sep = '\t')
experiments['Name'] = [filename.split('/')[-1].split('_R' if ',' in filename else '.fa')[0] for filename in experiments['Files']]
name2sample = {experiments.iloc[i]['Name'] : experiments.iloc[i]['Sample'] for i in range(len(experiments))}
mg_experiments = experiments[experiments["Data type"] == 'dna']

def preprocess_input(wildcards):
    # get files with matching names
    df = experiments.loc[experiments['Name'] == wildcards.name, 'Files']
    # get first value (in case multiple) and split on commas
    return df.iloc[0].split(',')

def join_reads_input(wildcards):
    df = mg_experiments.loc[mg_experiments['Sample'] == wildcards.sample, 'Files']
    names = [filename.split('/')[-1].split('_R' if ',' in filename else '.fa')[0] for filename in df]
    return ['{}/Preprocess/Trimmomatic/quality_trimmed_{}{}.fq'.format(config["output"], name, fr) for name in names
        for files in df for fr in (['_forward_paired', '_reverse_paired'] if ',' in files else [''])]

rule all:
    input:
        expand("{output}/Annotation/uniprotinfo.tsv", output = config["output"], sample = experiments["Sample"]),
        expand("{output}/Annotation/{sample}/protein2cog.tsv", output = config["output"], sample = experiments["Sample"]),
        expand("{output}/Preprocess/Trimmomatic/quality_trimmed_{name}{fr}.fq", output = config["output"],
            fr = (['_forward_paired', '_reverse_paired'] if experiments["Files"].str.contains(',').tolist() else ''),
               name = experiments['Name'])

rule preprocess:
    input:
        preprocess_input
    output:
        expand("{{output}}/Preprocess/Trimmomatic/quality_trimmed_{{name}}{fr}.fq",
            fr = (['_forward_paired', '_reverse_paired'] if experiments["Files"].str.contains(',').tolist() else ''))
    threads:
        config["threads"]
    run:
        shell("python preprocess.py -i {reads} -t {threads} -o {output}/Preprocess -adaptdir MOSCA/Databases/illumina_adapters -rrnadbs MOSCA/Databases/rRNA_databases -d {data_type}",
            output = config["output"], data_type = experiments.loc[experiments['Name'] == wildcards.name]["Data type"].iloc[0], reads = ",".join(input))

rule join_reads:
    input:
        join_reads_input
    output:
        expand("{output}/Assembly/{{sample}}/{{sample}}{fr}.fastq", output = config["output"],
            fr = (['_forward', '_reverse'] if experiments["Files"].str.contains(',').tolist() else ''))
    run:
        for file in input:
            print(file)
            if 'forward' in file:
                shell("touch {output}/Assembly/{wildcards.sample}/{wildcards.sample}_forward.fastq; cat {file} >> {output}/Assembly/{wildcards.sample}/{wildcards.sample}_forward.fastq", output = config["output"])
            elif 'reverse' in file:
                shell("touch {output}/Assembly/{wildcards.sample}/{wildcards.sample}_reverse.fastq; cat {file} >> {output}/Assembly/{wildcards.sample}/{wildcards.sample}_reverse.fastq", output = config["output"])
            else:
                shell("touch {output}/Assembly/{wildcards.sample}/{wildcards.sample}.fastq; cat {file} >> {output}/Assembly/{wildcards.sample}/{wildcards.sample}.fastq", output = config["output"])

rule assembly:
    input:
        expand("{output}/Assembly/{{sample}}/{{sample}}{fr}.fastq", output = config["output"],
            fr = (['_forward', '_reverse'] if experiments["Files"].str.contains(',').tolist() else ''))
    output:
        expand("{output}/Assembly/{{sample}}/contigs.fasta", output = config["output"])
    threads:
        config["threads"]
    run:
        reads = ",".join(input)
        shell("python assembly.py -r {reads} -t {threads} -o {output}/Assembly/{{sample}} -a {assembler}",
            output = config["output"], assembler = config["assembler"])

which might be very confusing because of noobness on my part.由于我的菜鸟,这可能会非常令人困惑。 rule preprocess runs the preprocess script, rule join_reads cats together the reads obtained (the Preprocess/Trimmomatic/quality_trimmed part) by sample (defined in the experiments file below), so they can be submitted together to assembly. rule preprocess运行预处理脚本, rule join_reads将通过样本(在下面的experiments文件中定义)获得的读数( Preprocess/Trimmomatic/quality_trimmed部分)集中在一起,因此它们可以一起提交到组装。 This is the config file:这是配置文件:

{
  "output": "output",
  "threads": 14,
  "experiments": "experiments.tsv",
  "assembler": "metaspades"
}

and this is the experiments.tsv file:这是experiments.tsv 文件:

Files   Sample  Data type   Condition
path/to/mg_R1.fastq,path/to/mg_R2.fastq Sample  dna
path/to/a/0.01/mt_0.01a_R1.fastq,path/to/a/0.01/mt_0.01a_R2.fastq   Sample  mrna    c1
path/to/b/0.01/mt_0.01b_R1.fastq,path/to/b/0.01/mt_0.01b_R2.fastq   Sample  mrna    c1
path/to/c/0.01/mt_0.01c_R1.fastq,path/to/c/0.01/mt_0.01c_R2.fastq   Sample  mrna    c1
path/to/a/1/mt_1a_R1.fastq,path/to/a/1/mt_1a_R2.fastq   Sample  mrna    c2
path/to/b/1/mt_1b_R1.fastq,path/to/b/1/mt_1b_R2.fastq   Sample  mrna    c2
path/to/c/1/mt_1c_R1.fastq,path/to/c/1/mt_1c_R2.fastq   Sample  mrna    c2
path/to/a/100/mt_100a_R1.fastq,path/to/a/100/mt_100a_R2.fastq   Sample  mrna    c3
path/to/b/100/mt_100b_R1.fastq,path/to/b/100/mt_100b_R2.fastq   Sample  mrna    c3
path/to/c/100/mt_100c_R1.fastq,path/to/c/100/mt_100c_R2.fastq   Sample  mrna    c3

The problem here is: the cat reports a MissingOutputException , because it can't find the file output/Assembly/{wildcards.sample}_forward.fastq (and the reverse).这里的问题是:猫报告了MissingOutputException ,因为它找不到文件output/Assembly/{wildcards.sample}_forward.fastq (反之亦然)。 It means wildcards.sample didn't convert to "Sample", which I don't understand why.这意味着通配符.sample 没有转换为“示例”,我不明白为什么。 However, the cat rule still manages to produce the files correctly, although it stops the workflow, which has to be executed again.但是, cat 规则仍然设法正确生成文件,尽管它停止了必须再次执行的工作流。 From there it goes well, because the assembly rule already has its input files.从那里开始一切顺利,因为装配规则已经有了它的输入文件。

Why is that wildcards.sample not converted to "Sample"?为什么通配符.sample没有转换为“样本”?

There's a lot here.这里有很多。 I think for your particular problem, when you use keyword arguments to shell it prevents snakemake from formatting the remaining wildcards.我认为对于您的特定问题,当您使用关键字参数进行 shell 时,它会阻止 snakemake 格式化剩余的通配符。 Change {output}/Assembly/{wildcards.sample}/{wildcards.sample}_reverse.fastq to {output}/Assembly/{sample}/{sample}_reverse.fastq and pass sample as an argument to shell.{output}/Assembly/{wildcards.sample}/{wildcards.sample}_reverse.fastq{output}/Assembly/{sample}/{sample}_reverse.fastq并将示例作为参数传递给 shell。

Other suggestions:其他建议:

  • Place input functions above the rules the apply to.将输入函数置于适用的规则之上。
  • Use multiple rules instead of complex expands in your inputs/outputs.在输入/输出中使用多个规则而不是复杂的扩展。 You can have two rules with the same output file, one that takes paired inputs and one with unpaired inputs.您可以在同一个输出文件中使用两条规则,一条采用成对的输入,另一条采用不成对的输入。
  • If you have a run directive that just invokes shell, replace that with shell.如果您有一个只调用 shell 的运行指令,请将其替换为 shell。 You can capture the reads=','.join(input) logic into a params directive.您可以将reads=','.join(input)逻辑捕获到 params 指令中。 You can directly place config[assembler] into a shell format token, eg shell: python assembly.py ... -a {config[assembler]} .您可以直接将 config[assembler] 放入 shell 格式标记,例如shell: python assembly.py ... -a {config[assembler]}
  • Use allow_missing in expand instead of escaping the wildcard formatting with {{}} .在 expand 中使用allow_missing而不是使用{{}}转义通配符格式。
  • cat {file} >> {output} will append the file to output even if the output doesn't exist (don't need the touch).即使输出不存在(不需要触摸), cat {file} >> {output}也会将文件附加到输出。
  • Try to keep your lines less than 100 characters so they will display properly on stackoverflow or github.尽量保持你的行少于 100 个字符,这样它们才能在 stackoverflow 或 github 上正确显示。

I think there is a lot of simplification you can do on the logic, but I don't know enough about your tools to recommend more specifics.我认为您可以对逻辑进行很多简化,但我对您的工具的了解不够,无法推荐更多细节。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM