[英]Trouble with wildcards in snakemake
I am having trouble with wildcards not converting to the supposed values.我在通配符无法转换为假定值时遇到问题。 This is the Snakefile:
这是 Snakefile:
import pandas as pd
configfile: "config.json"
experiments = pd.read_csv(config["experiments"], sep = '\t')
experiments['Name'] = [filename.split('/')[-1].split('_R' if ',' in filename else '.fa')[0] for filename in experiments['Files']]
name2sample = {experiments.iloc[i]['Name'] : experiments.iloc[i]['Sample'] for i in range(len(experiments))}
mg_experiments = experiments[experiments["Data type"] == 'dna']
def preprocess_input(wildcards):
# get files with matching names
df = experiments.loc[experiments['Name'] == wildcards.name, 'Files']
# get first value (in case multiple) and split on commas
return df.iloc[0].split(',')
def join_reads_input(wildcards):
df = mg_experiments.loc[mg_experiments['Sample'] == wildcards.sample, 'Files']
names = [filename.split('/')[-1].split('_R' if ',' in filename else '.fa')[0] for filename in df]
return ['{}/Preprocess/Trimmomatic/quality_trimmed_{}{}.fq'.format(config["output"], name, fr) for name in names
for files in df for fr in (['_forward_paired', '_reverse_paired'] if ',' in files else [''])]
rule all:
input:
expand("{output}/Annotation/uniprotinfo.tsv", output = config["output"], sample = experiments["Sample"]),
expand("{output}/Annotation/{sample}/protein2cog.tsv", output = config["output"], sample = experiments["Sample"]),
expand("{output}/Preprocess/Trimmomatic/quality_trimmed_{name}{fr}.fq", output = config["output"],
fr = (['_forward_paired', '_reverse_paired'] if experiments["Files"].str.contains(',').tolist() else ''),
name = experiments['Name'])
rule preprocess:
input:
preprocess_input
output:
expand("{{output}}/Preprocess/Trimmomatic/quality_trimmed_{{name}}{fr}.fq",
fr = (['_forward_paired', '_reverse_paired'] if experiments["Files"].str.contains(',').tolist() else ''))
threads:
config["threads"]
run:
shell("python preprocess.py -i {reads} -t {threads} -o {output}/Preprocess -adaptdir MOSCA/Databases/illumina_adapters -rrnadbs MOSCA/Databases/rRNA_databases -d {data_type}",
output = config["output"], data_type = experiments.loc[experiments['Name'] == wildcards.name]["Data type"].iloc[0], reads = ",".join(input))
rule join_reads:
input:
join_reads_input
output:
expand("{output}/Assembly/{{sample}}/{{sample}}{fr}.fastq", output = config["output"],
fr = (['_forward', '_reverse'] if experiments["Files"].str.contains(',').tolist() else ''))
run:
for file in input:
print(file)
if 'forward' in file:
shell("touch {output}/Assembly/{wildcards.sample}/{wildcards.sample}_forward.fastq; cat {file} >> {output}/Assembly/{wildcards.sample}/{wildcards.sample}_forward.fastq", output = config["output"])
elif 'reverse' in file:
shell("touch {output}/Assembly/{wildcards.sample}/{wildcards.sample}_reverse.fastq; cat {file} >> {output}/Assembly/{wildcards.sample}/{wildcards.sample}_reverse.fastq", output = config["output"])
else:
shell("touch {output}/Assembly/{wildcards.sample}/{wildcards.sample}.fastq; cat {file} >> {output}/Assembly/{wildcards.sample}/{wildcards.sample}.fastq", output = config["output"])
rule assembly:
input:
expand("{output}/Assembly/{{sample}}/{{sample}}{fr}.fastq", output = config["output"],
fr = (['_forward', '_reverse'] if experiments["Files"].str.contains(',').tolist() else ''))
output:
expand("{output}/Assembly/{{sample}}/contigs.fasta", output = config["output"])
threads:
config["threads"]
run:
reads = ",".join(input)
shell("python assembly.py -r {reads} -t {threads} -o {output}/Assembly/{{sample}} -a {assembler}",
output = config["output"], assembler = config["assembler"])
which might be very confusing because of noobness on my part.由于我的菜鸟,这可能会非常令人困惑。
rule preprocess
runs the preprocess script, rule join_reads
cats together the reads obtained (the Preprocess/Trimmomatic/quality_trimmed
part) by sample (defined in the experiments
file below), so they can be submitted together to assembly. rule preprocess
运行预处理脚本, rule join_reads
将通过样本(在下面的experiments
文件中定义)获得的读数( Preprocess/Trimmomatic/quality_trimmed
部分)集中在一起,因此它们可以一起提交到组装。 This is the config file:这是配置文件:
{
"output": "output",
"threads": 14,
"experiments": "experiments.tsv",
"assembler": "metaspades"
}
and this is the experiments.tsv file:这是experiments.tsv 文件:
Files Sample Data type Condition
path/to/mg_R1.fastq,path/to/mg_R2.fastq Sample dna
path/to/a/0.01/mt_0.01a_R1.fastq,path/to/a/0.01/mt_0.01a_R2.fastq Sample mrna c1
path/to/b/0.01/mt_0.01b_R1.fastq,path/to/b/0.01/mt_0.01b_R2.fastq Sample mrna c1
path/to/c/0.01/mt_0.01c_R1.fastq,path/to/c/0.01/mt_0.01c_R2.fastq Sample mrna c1
path/to/a/1/mt_1a_R1.fastq,path/to/a/1/mt_1a_R2.fastq Sample mrna c2
path/to/b/1/mt_1b_R1.fastq,path/to/b/1/mt_1b_R2.fastq Sample mrna c2
path/to/c/1/mt_1c_R1.fastq,path/to/c/1/mt_1c_R2.fastq Sample mrna c2
path/to/a/100/mt_100a_R1.fastq,path/to/a/100/mt_100a_R2.fastq Sample mrna c3
path/to/b/100/mt_100b_R1.fastq,path/to/b/100/mt_100b_R2.fastq Sample mrna c3
path/to/c/100/mt_100c_R1.fastq,path/to/c/100/mt_100c_R2.fastq Sample mrna c3
The problem here is: the cat reports a MissingOutputException
, because it can't find the file output/Assembly/{wildcards.sample}_forward.fastq
(and the reverse).这里的问题是:猫报告了
MissingOutputException
,因为它找不到文件output/Assembly/{wildcards.sample}_forward.fastq
(反之亦然)。 It means wildcards.sample didn't convert to "Sample", which I don't understand why.这意味着通配符.sample 没有转换为“示例”,我不明白为什么。 However, the cat rule still manages to produce the files correctly, although it stops the workflow, which has to be executed again.
但是, cat 规则仍然设法正确生成文件,尽管它停止了必须再次执行的工作流。 From there it goes well, because the assembly rule already has its input files.
从那里开始一切顺利,因为装配规则已经有了它的输入文件。
Why is that wildcards.sample not converted to "Sample"?为什么通配符.sample没有转换为“样本”?
There's a lot here.这里有很多。 I think for your particular problem, when you use keyword arguments to shell it prevents snakemake from formatting the remaining wildcards.
我认为对于您的特定问题,当您使用关键字参数进行 shell 时,它会阻止 snakemake 格式化剩余的通配符。 Change
{output}/Assembly/{wildcards.sample}/{wildcards.sample}_reverse.fastq
to {output}/Assembly/{sample}/{sample}_reverse.fastq
and pass sample as an argument to shell.将
{output}/Assembly/{wildcards.sample}/{wildcards.sample}_reverse.fastq
为{output}/Assembly/{sample}/{sample}_reverse.fastq
并将示例作为参数传递给 shell。
Other suggestions:其他建议:
reads=','.join(input)
logic into a params directive.reads=','.join(input)
逻辑捕获到 params 指令中。 You can directly place config[assembler] into a shell format token, eg shell: python assembly.py ... -a {config[assembler]}
.shell: python assembly.py ... -a {config[assembler]}
。allow_missing
in expand instead of escaping the wildcard formatting with {{}}
.allow_missing
而不是使用{{}}
转义通配符格式。cat {file} >> {output}
will append the file to output even if the output doesn't exist (don't need the touch).cat {file} >> {output}
也会将文件附加到输出。 I think there is a lot of simplification you can do on the logic, but I don't know enough about your tools to recommend more specifics.我认为您可以对逻辑进行很多简化,但我对您的工具的了解不够,无法推荐更多细节。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.