当不需要通配符的某些组合（缺少输入文件）且使用“合并”规则时，如何在snakemake中使用expand？

Question

this is a slightly more complicated case than the one reported here . 这比这里报告的情况稍微复杂一些。 My input files are the following: 我的输入文件如下：

ont_Hv1_2.5+.fastq                                                                                                                                                                              
ont_Hv2_2.5+.fastq                                                                                                                                                                              
pacBio_Hv1_1-1.5.fastq                                                                                                                                                                          
pacBio_Hv1_1.5-2.5.fastq                                                                                                                                                                        
pacBio_Hv1_2.5+.fastq                                                                                                                                                                           
pacBio_Hv2_1-1.5.fastq                                                                                                                                                                          
pacBio_Hv2_1.5-2.5.fastq
pacBio_Hv2_2.5+.fastq
pacBio_Mv1_1-1.5.fastq
pacBio_Mv1_1.5-2.5.fastq
pacBio_Mv1_2.5+.fastq

I would like to process only existing input files, ie automatically skip those wildcard combinations that correspond to non-existing input files. 我只想处理现有的输入文件，即自动跳过那些与不存在的输入文件相对应的通配符组合。

My Snakefile looks like this: 我的Snakefile看起来像这样：

import glob
import os.path
from itertools import product

#make wildcards regexps non-greedy:
wildcard_constraints:
    capDesign = "[^_/]+",
    sizeFrac = "[^_/]+",
    techname = "[^_/]+",

# get TECHNAMES (sequencing technology, i.e. 'ont' or 'pacBio'), CAPDESIGNS (capture designs, i.e. Hv1, Hv2, Mv1) and SIZEFRACS (size fractions) variables from input FASTQ file names:
(TECHNAMES, CAPDESIGNS, SIZEFRACS) = glob_wildcards("{techname}_{capDesign}_{sizeFrac}.fastq")
# make lists non-redundant:
CAPDESIGNS=set(CAPDESIGNS)
SIZEFRACS=set(SIZEFRACS)
TECHNAMES=set(TECHNAMES)

# make list of authorized wildcard combinations (based on presence of input files)
AUTHORIZEDCOMBINATIONS = []
for comb in product(TECHNAMES,CAPDESIGNS,SIZEFRACS):
    if(os.path.isfile(comb[0] + "_" + comb[1] + "_" + comb[2] + ".fastq")):
        tup=(("techname", comb[0]),("capDesign", comb[1]),("sizeFrac", comb[2]))
        AUTHORIZEDCOMBINATIONS.append(tup)

# Function to create filtered combinations of wildcards, based on the presence of input files.
# Inspired by:
# https://stackoverflow.com/questions/41185567/how-to-use-expand-in-snakemake-when-some-particular-combinations-of-wildcards-ar
def filter_combinator(whitelist):
    def filtered_combinator(*args, **kwargs):
        for wc_comb in product(*args, **kwargs):
            for ac in AUTHORIZEDCOMBINATIONS:
                if(wc_comb[0:3] == ac):
                    print ("SUCCESS")
                    yield(wc_comb)
                    break
    return filtered_combinator

filtered_product = filter_combinator(AUTHORIZEDCOMBINATIONS)

rule all:
    input:
        expand("{techname}_{capDesign}_all.readlength.tsv", filtered_product, techname=TECHNAMES, capDesign=CAPDESIGNS, sizeFrac=SIZEFRACS)

#get read lengths for all FASTQ files:
rule getReadLength:
    input: "{techname}_{capDesign}_{sizeFrac}.fastq"
    output: "{techname}_{capDesign}_{sizeFrac}.readlength.tsv"
    shell: "fastq2tsv.pl {input} | awk -v s={wildcards.sizeFrac} '{{print s\"\\t\"length($2)}}' > {output}" #fastq2tsv.pl converts each FASTQ record into a tab-separated line, with the sequence in second field

#combine read length data over all sizeFracs of a given techname/capDesign combo:
rule aggReadLength:
    input: expand("{{techname}}_{{capDesign}}_{sizeFrac}.readlength.tsv", sizeFrac=SIZEFRACS)
    output: "{techname}_{capDesign}_all.readlength.tsv"
    shell: "cat {input} > {output}"

Rule getReadLength collects read lengths for each input FASTQ (ie for each techname, capDesign, sizeFrac combo). 规则getReadLength收集每个输入FASTQ（即每个技术techname, capDesign, sizeFrac组合）的读取长度。

Rule aggReadLength merges read length statistics generated by getReadLength , for each techname, capDesign combo. 规则aggReadLength合并由getReadLength生成的读取长度统计信息，用于每个技术techname, capDesign组合。

The workflow fails with the following message: 工作流失败，并显示以下消息：

Missing input files for rule getReadLength:
ont_Hv1_1-1.5.fastq

So it seems that the wildcard combination filtering step applied to the target is not propagated to all upstream rules it depends on. 因此，似乎未将应用于目标的通配符组合过滤步骤传播到它所依赖的所有上游规则。 Anyone knows how to make it so? 有人知道怎么做吗？

(Using Snakemake version 4.4.0.) （使用Snakemake版本4.4.0。）

Thanks a lot in advance 提前谢谢

Answer 1

I think I solved the problem, hopefully this will be useful to someone else. 我想我解决了这个问题，希望对其他人有用。

In the aggReadlength rule, I replaced 在aggReadlength规则中，我替换了

input: expand("{{techname}}_{{capDesign}}_{sizeFrac}.readlength.tsv", sizeFrac=SIZEFRACS)

with 同

input: lambda wildcards: expand("{techname}_{capDesign}_{sizeFrac}.readlength.tsv", filtered_product, techname=wildcards.techname, capDesign=wildcards.capDesign, sizeFrac=SIZEFRACS)

当不需要通配符的某些组合（缺少输入文件）且使用“合并”规则时，如何在snakemake中使用expand？

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-01-18 14:25:08

当不需要通配符的某些组合（缺少输入文件）且使用“合并”规则时，如何在snakemake中使用expand？

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-01-18 14:25:08

解决方案1
1 已采纳 2018-01-18 14:25:08