简体   繁体   English

Snakemake:是否可以使用目录作为通配符?

[英]Snakemake: Is it possible to use directories as wildcards?

Hi I´m new in Snakemake and have a question.嗨,我是 Snakemake 的新手,有一个问题。 I want to run a tool to multiple data sets.我想对多个数据集运行一个工具。 One data set represents one tissue and for each tissue exists fastq files, which are stored in the respective tissue directory.一个数据集代表一个组织,每个组织都存在 fastq 文件,这些文件存储在相应的组织目录中。 The rough command for the tools is:工具的粗略命令是:

  python TEcount.py -rosette rosettefile -TE te_references -count result/tissue/output.csv -RNA <LIST OF FASTQ FILE FOR THE RESPECTIVE SAMPLE>          

The tissues shall be the wildcards.组织应为通配符。 How can I do this?我怎样才能做到这一点? Below I have a first try that did not work.下面我第一次尝试没有用。

import os                                                                        

#collect data sets                                                               
SAMPLES=os.listdir("data/rnaseq/")                                               


rule all:                                                                        
    input:                                                                       
        expand("results/{sample}/TEtools.{sample}.output.csv", sample=SAMPLES)                   

rule run_TEtools:                                                                
    input:                                                                       
        TEcount='scripts/TEtools/TEcount.py',                                    
        rosette='data/prepared_data/rosette/rosette',                            
        te_references='data/prepared_data/references/all_TE_instances.fa'        
    params:
        #collect the fastq files in the tissue directory                                                              
        fastq_files = os.listdir("data/rnaseq/{sample}")                         
    output:                                                                      
        'results/{sample}/TEtools.{sample}.output.csv'                           
    shell:                                                                       
        'python {input.TEcount} -rosette {input.rosette} -TE                     
{input.te_references} -count {output} -RNA {params.fastq_files}'

In the rule run_TEtools it does not know what the {sample} is.在 run_TEtools 规则中,它不知道 {sample} 是什么。

A snakemake wildcard can be anything.蛇形通配符可以是任何东西。 It is basically just a string.它基本上只是一个字符串。
There are some issues with the way you are trying to achieve what you want.您尝试实现所需的方式存在一些问题。

Ok, here's how I would do it.好的,这就是我的方法。 Explanations follow:解释如下:

import os                                                                        

#collect data sets
# Beware no other directories or files (than those containing fastqs) should be in that folder                                                        
SAMPLES=os.listdir("data/rnaseq/")                                               

def getFastqFilesForTissu(wildcards):
    fastqs = list()
    # Beware no other files than fastqs should be there
    for s in os.listdir("data/rnaseq/"+wildcards.sample):
        fastqs.append(os.path.join("data/rnaseq",wildcards.sample,s))
    return fastqs

rule all:                                                                        
    input:                                                                       
        expand("results/{sample}/TEtools.{sample}.output.csv", sample=SAMPLES)                   

rule run_TEtools:                                                                
    input:                                                                       
        TEcount='scripts/TEtools/TEcount.py',                                    
        rosette='data/prepared_data/rosette/rosette',                            
        te_references='data/prepared_data/references/all_TE_instances.fa',
        fastq_files = getFastqFilesForTissu        
    output:                                                                      
        'results/{sample}/TEtools.{sample}.output.csv'                           
    shell:                                                                       
        'python {input.TEcount} -rosette {input.rosette} -TE {input.te_references} -count {output} -RNA {input.fastq_files}'

First of all, your fastq file should be defined as inputs in order for snakemake to know that they are files and that if they are changed, rules must be rerun.首先,你的fastq文件应该被定义为输入,以便snakemake知道它们是文件,并且如果它们被更改,则必须重新运行规则。 It is quite bad practice to define input files as params .将输入文件定义为params是非常糟糕的做法。 params are made for parameters, usually not for files. params是为参数制作的,通常不是为文件制作的。
Second, your script file is defined as input.其次,您的脚本文件被定义为输入。 You have to be aware that everytime you modify it, rules will be rerun.您必须注意,每次修改它时,都会重新运行规则。 Maybe that's what you want.也许这就是你想要的。

I would use a defined function to get the fastq file in each directory.我会使用一个定义的函数来获取每个目录中的 fastq 文件。 If you want to use a function (like os.listdir() ), you can't use your wildcards directly.如果要使用函数(如os.listdir() ),则不能直接使用通配符。 You have to inject it in the function as a python object.您必须将其作为 python 对象注入到函数中。 You can either define a function that will take one argument, a wildcard object containing all your wildcards, or use the lambda keyword (ex: input = lamdba wildcards: myFuntion(wildcards.sample) ).您可以定义一个接受一个参数的函数,一个包含所有通配符的通配符对象,或者使用 lambda 关键字(例如: input = lamdba wildcards: myFuntion(wildcards.sample) )。
Another problem you have is that os.listdir() returns a list of files without the relative path.您遇到的另一个问题是os.listdir()返回没有相对路径的文件列表。 Also beware that the order in which os.listdir() will return you fastq file is unknown.还要注意os.listdir()返回 fastq 文件的顺序是未知的。 Maybe that doesn't matter for your command.也许这对您的命令无关紧要。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM