简体   繁体   English

Snakemake 多个通配符和 argparse arguments

[英]Snakemake multiple wildcards and argparse arguments

I am new to snakemake and finding it very difficult to do simplest of things it can do.我是snakemake的新手,发现很难做最简单的事情。 For illustration, I have written a program adding_text.py that takes arguments (argparse) of an input directory, an output directory and index (from os.listdir of the input directory) to process some text files.为了说明,我编写了一个程序adding_text.py ,它采用输入目录的arguments(argparse)、output 目录和索引(来自输入目录的os.listdir )来处理一些文本文件。

This is my file structure:这是我的文件结构:

identity_category1  
|----A.txt -> text A identity  
|----B.txt -> text B identity  
|----C.txt -> text C identity  
identity_category2  
|----P.txt -> text P identity  
|----Q.txt -> text Q identity  
|----R.txt -> text R identity  
identity_category3  
|----X.txt -> text X identity  
|----Y.txt -> text Y identity  
|----Z.txt -> text Z identity  

And this is my code adding_text.py :这是我的代码adding_text.py

import argparse
import os
my_parser = argparse.ArgumentParser(usage='python %(prog)s [-h] input_dir output_dir file_index')
my_parser.add_argument('input_dir', type=str)
my_parser.add_argument('output_dir', type=str)
my_parser.add_argument('file_index', type=int)
args = my_parser.parse_args()

input_dir = args.input_dir
output_dir = args.output_dir
file_index = args.file_index
if not os.path.exists(output_dir):
    os.mkdir(output_dir)

filelist = os.listdir(input_dir)
input_file = open(os.path.join(input_dir, filelist[file_index]), 'r')
output_file = open(os.path.join(output_dir, filelist[file_index].split('.')[0] + '_added.txt'), 'w')
output_file.write(input_file.read() + ' has been added\n')

All I am doing is firing the following commands at console:我所做的就是在控制台上触发以下命令:

python adding_text.py identity_category1 1_added 0
python adding_text.py identity_category1 1_added 1
python adding_text.py identity_category1 1_added 2
python adding_text.py identity_category2 2_added 0
python adding_text.py identity_category2 2_added 1
python adding_text.py identity_category2 2_added 2
python adding_text.py identity_category3 3_added 0
python adding_text.py identity_category3 3_added 1
python adding_text.py identity_category3 3_added 2

And get the following output (structure):并得到如下output(结构):

1_added
|----A_added.txt -> text A identity has been added
|----B_added.txt -> text B identity has been added
|----C_added.txt -> text C identity has been added
2_added
|----P_added.txt -> text P identity has been added
|----Q_added.txt -> text Q identity has been added
|----R_added.txt -> text R identity has been added
3_added
|----X_added.txt -> text X identity has been added
|----Y_added.txt -> text Y identity has been added
|----Z_added.txt -> text Z identity has been added

So the python coding isnt the problem.所以 python 编码不是问题。 The problem is when I am trying to design a snakemake workflow around the problem, involving multiple wildcards, dependencies etc. My possible_snakefile looks like this问题是当我试图围绕该问题设计一个snakemake工作流程时,涉及多个通配符、依赖项等。我possible_snakefile看起来像这样

NUM = ["1", "2", "3"]
SAMPLE = ["A", "B", "C"]

rule add_text:
    input: 
        expand("identity_category{num}/{sample}.txt", num=NUM, sample=SAMPLE)
    output: 
        expand("{num}_added/{sample}_added.txt", num=NUM, sample=SAMPLE)
    run:
        for index in range(0,3):
            shell("python adding_text.py identity_category{num} {num}_added index")

When I try to specify a target and perform a dry run via snakemake --cores 1 -n -s possible_snakefile 1_added/A_added.txt , it incorrectly maps input directories and respective files and throws me this error:当我尝试通过snakemake --cores 1 -n -s possible_snakefile 1_added/A_added.txt指定目标并执行试运行时,它错误地映射输入目录和相应的文件并抛出此错误:

MissingInputException in line 4 possible_snakefile:
Missing input files for rule add_text:
identity_category3/C.txt
identity_category2/A.txt
identity_category3/B.txt
identity_category2/B.txt
identity_category2/C.txt
identity_category3/A.txt

I am sure its very simple, but I am not just able to get my head around it.我确信它非常简单,但我无法完全理解它。 ie different wildcard specification in possible_snakefile or specifying different target files at command line.即在possible_snakefile中指定不同的通配符或在命令行指定不同的目标文件。 I would appreciate help here.我会很感激这里的帮助。 Thank you谢谢

First of all your design is not very good as it relies on the order of filenames.首先,您的设计不是很好,因为它依赖于文件名的顺序。 That means that if you add one more file into the identity_category{num} directory, the result would change.这意味着如果您在identity_category{num}目录中再添加一个文件,结果将会改变。 That complicates the pipeline, makes it less predictable, and I'd advise you to rework the script and make the dependencies explicit.这会使管道复杂化,使其难以预测,我建议您重新编写脚本并使依赖关系明确。 Anyway, in the rest of my answer I would assume that the script is something that you cannot change.无论如何,在我回答的 rest 中,我会假设脚本是您无法更改的。

You need to specify a target : the file (or a group of files or directories) that the pipeline shall produce.您需要指定一个目标:管道应生成的文件(或一组文件或目录)。 This target shall have no wildcards, as the target shall be explicit.该目标不应有通配符,因为目标应是明确的。 Using your script it is not so obvious what the target is, but you may specify the group of {num}_added directories what you plan to get from the pipeline:使用您的脚本,目标是什么并不那么明显,但您可以指定{num}_added目录组,您计划从管道中获得什么:

rule target:
    input:
        expand("{num}_added", num=NUM)

Note that the {num} here is not a wildcard, as it is fully resolved in the expand function.注意这里的{num}不是通配符,因为它在expand function 中完全解析。 Actually this function would return a list of three elements: ["1_added", "2_added", "3_added"] , and Snakemake would know what to produce:实际上,这个 function 会返回一个包含三个元素的列表: ["1_added", "2_added", "3_added"] ,Snakemake 会知道要生成什么:

rule target:
    input:
        ["1_added", "2_added", "3_added"]

In addition note that the name target is arbitrary, but this has to be the topmost rule in your Snakefile.另外请注意,名称目标是任意的,但这必须是您的 Snakefile 中最顶层的规则。

Ok, now Snakemake knows that it needs to produce 3 objects, and you can instruct it how to produce each of them:好的,现在 Snakemake 知道它需要生成 3 个对象,您可以指示它如何生成每个对象:

rule make_added:
    input:
        "identity_category{num}"
    output:
        "{num}_added"
    ...
    # some magic would come here later

This rule instructs Snakemake that to produce a single {num}_added directory it needs another directory identity_category{num} where the {num} has to match.此规则指示 Snakemake 生成单个{num}_added目录,它需要另一个目录identity_category{num} ,其中{num}必须匹配。 The {num} here is a wildcard, Snakemake would substitute it's value automatically, and it would run this rule 3 times (actually len(NUM) times).这里的{num}是一个通配符,Snakemake 会自动替换它的值,它会运行这个规则 3 次(实际上是len(NUM)次)。

Now let's call your script:现在让我们调用您的脚本:

rule make_added:
    input:
        "identity_category{num}"
    output:
        "{num}_added"
    run:
        for index in range(0, 3):
            shell("python adding_text.py identity_category{wildcards.num} {wildcards.num}_added {index}")

Here you cannot name the wildcard simply by name.在这里,您不能简单地按名称命名通配符。 Moreover, you need to put the variable index into braces.此外,您需要将变量index放入大括号中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM