[英]How to use expand in snakemake when some combinations of wildcards are not desired (missing input files), with a “merge” rule?
this is a slightly more complicated case than the one reported here . 这比这里报告的情况稍微复杂一些。 My input files are the following:
我的输入文件如下:
ont_Hv1_2.5+.fastq
ont_Hv2_2.5+.fastq
pacBio_Hv1_1-1.5.fastq
pacBio_Hv1_1.5-2.5.fastq
pacBio_Hv1_2.5+.fastq
pacBio_Hv2_1-1.5.fastq
pacBio_Hv2_1.5-2.5.fastq
pacBio_Hv2_2.5+.fastq
pacBio_Mv1_1-1.5.fastq
pacBio_Mv1_1.5-2.5.fastq
pacBio_Mv1_2.5+.fastq
I would like to process only existing input files, ie automatically skip those wildcard combinations that correspond to non-existing input files. 我只想处理现有的输入文件, 即自动跳过那些与不存在的输入文件相对应的通配符组合。
My Snakefile looks like this: 我的Snakefile看起来像这样:
import glob
import os.path
from itertools import product
#make wildcards regexps non-greedy:
wildcard_constraints:
capDesign = "[^_/]+",
sizeFrac = "[^_/]+",
techname = "[^_/]+",
# get TECHNAMES (sequencing technology, i.e. 'ont' or 'pacBio'), CAPDESIGNS (capture designs, i.e. Hv1, Hv2, Mv1) and SIZEFRACS (size fractions) variables from input FASTQ file names:
(TECHNAMES, CAPDESIGNS, SIZEFRACS) = glob_wildcards("{techname}_{capDesign}_{sizeFrac}.fastq")
# make lists non-redundant:
CAPDESIGNS=set(CAPDESIGNS)
SIZEFRACS=set(SIZEFRACS)
TECHNAMES=set(TECHNAMES)
# make list of authorized wildcard combinations (based on presence of input files)
AUTHORIZEDCOMBINATIONS = []
for comb in product(TECHNAMES,CAPDESIGNS,SIZEFRACS):
if(os.path.isfile(comb[0] + "_" + comb[1] + "_" + comb[2] + ".fastq")):
tup=(("techname", comb[0]),("capDesign", comb[1]),("sizeFrac", comb[2]))
AUTHORIZEDCOMBINATIONS.append(tup)
# Function to create filtered combinations of wildcards, based on the presence of input files.
# Inspired by:
# https://stackoverflow.com/questions/41185567/how-to-use-expand-in-snakemake-when-some-particular-combinations-of-wildcards-ar
def filter_combinator(whitelist):
def filtered_combinator(*args, **kwargs):
for wc_comb in product(*args, **kwargs):
for ac in AUTHORIZEDCOMBINATIONS:
if(wc_comb[0:3] == ac):
print ("SUCCESS")
yield(wc_comb)
break
return filtered_combinator
filtered_product = filter_combinator(AUTHORIZEDCOMBINATIONS)
rule all:
input:
expand("{techname}_{capDesign}_all.readlength.tsv", filtered_product, techname=TECHNAMES, capDesign=CAPDESIGNS, sizeFrac=SIZEFRACS)
#get read lengths for all FASTQ files:
rule getReadLength:
input: "{techname}_{capDesign}_{sizeFrac}.fastq"
output: "{techname}_{capDesign}_{sizeFrac}.readlength.tsv"
shell: "fastq2tsv.pl {input} | awk -v s={wildcards.sizeFrac} '{{print s\"\\t\"length($2)}}' > {output}" #fastq2tsv.pl converts each FASTQ record into a tab-separated line, with the sequence in second field
#combine read length data over all sizeFracs of a given techname/capDesign combo:
rule aggReadLength:
input: expand("{{techname}}_{{capDesign}}_{sizeFrac}.readlength.tsv", sizeFrac=SIZEFRACS)
output: "{techname}_{capDesign}_all.readlength.tsv"
shell: "cat {input} > {output}"
Rule getReadLength
collects read lengths for each input FASTQ (ie for each techname, capDesign, sizeFrac
combo). 规则
getReadLength
收集每个输入FASTQ(即每个技术techname, capDesign, sizeFrac
组合)的读取长度。
Rule aggReadLength
merges read length statistics generated by getReadLength
, for each techname, capDesign
combo. 规则
aggReadLength
合并由getReadLength
生成的读取长度统计信息,用于每个技术techname, capDesign
组合。
The workflow fails with the following message: 工作流失败,并显示以下消息:
Missing input files for rule getReadLength:
ont_Hv1_1-1.5.fastq
So it seems that the wildcard combination filtering step applied to the target is not propagated to all upstream rules it depends on. 因此,似乎未将应用于目标的通配符组合过滤步骤传播到它所依赖的所有上游规则。 Anyone knows how to make it so?
有人知道怎么做吗?
(Using Snakemake version 4.4.0.) (使用Snakemake版本4.4.0。)
Thanks a lot in advance 提前谢谢
I think I solved the problem, hopefully this will be useful to someone else. 我想我解决了这个问题,希望对其他人有用。
In the aggReadlength
rule, I replaced 在
aggReadlength
规则中,我替换了
input: expand("{{techname}}_{{capDesign}}_{sizeFrac}.readlength.tsv", sizeFrac=SIZEFRACS)
with 同
input: lambda wildcards: expand("{techname}_{capDesign}_{sizeFrac}.readlength.tsv", filtered_product, techname=wildcards.techname, capDesign=wildcards.capDesign, sizeFrac=SIZEFRACS)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.