简体   繁体   English

当不需要通配符的某些组合(缺少输入文件)且使用“合并”规则时,如何在snakemake中使用expand?

[英]How to use expand in snakemake when some combinations of wildcards are not desired (missing input files), with a “merge” rule?

this is a slightly more complicated case than the one reported here . 这比这里报告的情况稍微复杂一些。 My input files are the following: 我的输入文件如下:

ont_Hv1_2.5+.fastq                                                                                                                                                                              
ont_Hv2_2.5+.fastq                                                                                                                                                                              
pacBio_Hv1_1-1.5.fastq                                                                                                                                                                          
pacBio_Hv1_1.5-2.5.fastq                                                                                                                                                                        
pacBio_Hv1_2.5+.fastq                                                                                                                                                                           
pacBio_Hv2_1-1.5.fastq                                                                                                                                                                          
pacBio_Hv2_1.5-2.5.fastq
pacBio_Hv2_2.5+.fastq
pacBio_Mv1_1-1.5.fastq
pacBio_Mv1_1.5-2.5.fastq
pacBio_Mv1_2.5+.fastq

I would like to process only existing input files, ie automatically skip those wildcard combinations that correspond to non-existing input files. 我只想处理现有的输入文件, 自动跳过那些与不存在的输入文件相对应的通配符组合。

My Snakefile looks like this: 我的Snakefile看起来像这样:

import glob
import os.path
from itertools import product

#make wildcards regexps non-greedy:
wildcard_constraints:
    capDesign = "[^_/]+",
    sizeFrac = "[^_/]+",
    techname = "[^_/]+",

# get TECHNAMES (sequencing technology, i.e. 'ont' or 'pacBio'), CAPDESIGNS (capture designs, i.e. Hv1, Hv2, Mv1) and SIZEFRACS (size fractions) variables from input FASTQ file names:
(TECHNAMES, CAPDESIGNS, SIZEFRACS) = glob_wildcards("{techname}_{capDesign}_{sizeFrac}.fastq")
# make lists non-redundant:
CAPDESIGNS=set(CAPDESIGNS)
SIZEFRACS=set(SIZEFRACS)
TECHNAMES=set(TECHNAMES)

# make list of authorized wildcard combinations (based on presence of input files)
AUTHORIZEDCOMBINATIONS = []
for comb in product(TECHNAMES,CAPDESIGNS,SIZEFRACS):
    if(os.path.isfile(comb[0] + "_" + comb[1] + "_" + comb[2] + ".fastq")):
        tup=(("techname", comb[0]),("capDesign", comb[1]),("sizeFrac", comb[2]))
        AUTHORIZEDCOMBINATIONS.append(tup)

# Function to create filtered combinations of wildcards, based on the presence of input files.
# Inspired by:
# https://stackoverflow.com/questions/41185567/how-to-use-expand-in-snakemake-when-some-particular-combinations-of-wildcards-ar
def filter_combinator(whitelist):
    def filtered_combinator(*args, **kwargs):
        for wc_comb in product(*args, **kwargs):
            for ac in AUTHORIZEDCOMBINATIONS:
                if(wc_comb[0:3] == ac):
                    print ("SUCCESS")
                    yield(wc_comb)
                    break
    return filtered_combinator

filtered_product = filter_combinator(AUTHORIZEDCOMBINATIONS)

rule all:
    input:
        expand("{techname}_{capDesign}_all.readlength.tsv", filtered_product, techname=TECHNAMES, capDesign=CAPDESIGNS, sizeFrac=SIZEFRACS)

#get read lengths for all FASTQ files:
rule getReadLength:
    input: "{techname}_{capDesign}_{sizeFrac}.fastq"
    output: "{techname}_{capDesign}_{sizeFrac}.readlength.tsv"
    shell: "fastq2tsv.pl {input} | awk -v s={wildcards.sizeFrac} '{{print s\"\\t\"length($2)}}' > {output}" #fastq2tsv.pl converts each FASTQ record into a tab-separated line, with the sequence in second field

#combine read length data over all sizeFracs of a given techname/capDesign combo:
rule aggReadLength:
    input: expand("{{techname}}_{{capDesign}}_{sizeFrac}.readlength.tsv", sizeFrac=SIZEFRACS)
    output: "{techname}_{capDesign}_all.readlength.tsv"
    shell: "cat {input} > {output}"

Rule getReadLength collects read lengths for each input FASTQ (ie for each techname, capDesign, sizeFrac combo). 规则getReadLength收集每个输入FASTQ(即每个技术techname, capDesign, sizeFrac组合)的读取长度。

Rule aggReadLength merges read length statistics generated by getReadLength , for each techname, capDesign combo. 规则aggReadLength合并由getReadLength生成的读取长度统计信息,用于每个技术techname, capDesign组合。

The workflow fails with the following message: 工作流失败,并显示以下消息:

Missing input files for rule getReadLength:
ont_Hv1_1-1.5.fastq

So it seems that the wildcard combination filtering step applied to the target is not propagated to all upstream rules it depends on. 因此,似乎未将应用于目标的通配符组合过滤步骤传播到它所依赖的所有上游规则。 Anyone knows how to make it so? 有人知道怎么做吗?

(Using Snakemake version 4.4.0.) (使用Snakemake版本4.4.0。)

Thanks a lot in advance 提前谢谢

I think I solved the problem, hopefully this will be useful to someone else. 我想我解决了这个问题,希望对其他人有用。

In the aggReadlength rule, I replaced aggReadlength规则中,我替换了

input: expand("{{techname}}_{{capDesign}}_{sizeFrac}.readlength.tsv", sizeFrac=SIZEFRACS)

with

input: lambda wildcards: expand("{techname}_{capDesign}_{sizeFrac}.readlength.tsv", filtered_product, techname=wildcards.techname, capDesign=wildcards.capDesign, sizeFrac=SIZEFRACS)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 实现如果不希望使用某些特定的通配符组合,如何在snakemake中使用expand? - Implementation How to use expand in snakemake when some particular combinations of wildcards are not desired? 在snakemake中缺少所有规则的输入文件 - Missing input files for rule all in snakemake Snakemake:缺少所有规则的输入文件 - Snakemake : Missing input files for rule all Snakemake - 第 20 行中缺少 MissingInputException:缺少规则字符串的输入文件: - Snakemake - Missing MissingInputException in line 20: Missing input files for rule stringt: 如何使用 Snakemake 中的展开 function 对列表进行排列或组合 - How to use the expand function in Snakemake to make permutations or combinations of a list snakemake:如何将 glob_wildcards 用于新创建的文件? - snakemake: How to use glob_wildcards for newly created files? 是否可以在 Snakemake 管道的配置文件中使用通配符? - Is it possible to use wildcards in config files for a Snakemake pipeline? 当不是所有作业都成功 output 文件时,我如何编写一个蛇形输入? - How do I write a snakemake input when not all jobs successfully output files from previous rule? 缺少规则所有的输入文件 [snakemake] - Missing input file for rule all [snakemake] Snakemake - 无法从输出文件中确定输入文件中的通配符 - Snakemake - Wildcards in input files cannot be determined from output files
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM