Snakemake - 從輸入文件動態派生目標

Question

我有大量這樣組織的輸入文件：

data/
├── set1/
│   ├── file1_R1.fq.gz
│   ├── file1_R2.fq.gz
│   ├── file2_R1.fq.gz
│   ├── file2_R2.fq.gz
|   :
│   └── fileX_R2.fq.gz
├── another_set/
│   ├── asdf1_R1.fq.gz
│   ├── asdf1_R2.fq.gz
│   ├── asdf2_R1.fq.gz
│   ├── asdf2_R2.fq.gz
|   :
│   └── asdfX_R2.fq.gz
:   
└── many_more_sets/
    ├── zxcv1_R1.fq.gz
    ├── zxcv1_R2.fq.gz
    :
    └── zxcvX_R2.fq.gz

如果您熟悉生物信息學 - 這些當然是來自配對末端測序運行的 fastq 文件。 我正在嘗試生成一個可以讀取所有這些的蛇形工作流程，但我已經在第一條規則上失敗了。 這是我的嘗試：

configfile: "config.yaml"

rule all:
    input:
        read1=expand("{output}/clipped_and_trimmed_reads/{{sample}}_R1.fq.gz", output=config["output"]),
        read2=expand("{output}/clipped_and_trimmed_reads/{{sample}}_R2.fq.gz", output=config["output"])

rule clip_and_trim_reads:
    input:
        read1=expand("{data}/{set}/{{sample}}_R1.fq.gz", data=config["raw_data"], set=config["sets"]),
        read2=expand("{data}/{set}/{{sample}}_R2.fq.gz", data=config["raw_data"], set=config["sets"])
    output:
        read1=expand("{output}/clipped_and_trimmed_reads/{{sample}}_R1.fq.gz", output=config["output"]),
        read2=expand("{output}/clipped_and_trimmed_reads/{{sample}}_R2.fq.gz", output=config["output"])
    threads: 16
    shell:
        """
        someTool -o {output.read1} -p {output.read2} \
        {input.read1} {input.read2}
        """

我不能將clip_and_trim_reads指定為目標，因為Target rules may not contain wildcards. 我嘗試添加all規則，但這給了我另一個錯誤：

$ snakemake -np
Building DAG of jobs...
WildcardError in line 3 of /work/project/Snakefile:
Wildcards in input files cannot be determined from output files:
'sample'

我還嘗試對all規則使用dynamic()函數，它奇怪地找到了文件，但仍然給了我這個錯誤：

$ snakemake -np
Dynamic output is deprecated in favor of checkpoints (see docs). It will be removed in Snakemake 6.0.
Building DAG of jobs...
MissingInputException in line 7 of /work/project/ladsie_002/analyses/scripts/2019-08-02_splice_leader_HiC/Snakefile:
Missing input files for rule clip_and_trim_reads:
data/raw_data/set1/__snakemake_dynamic___R1.fq.gz
data/raw_data/set1/__snakemake_dynamic___R2.fq.gz
data/raw_data/set1/__snakemake_dynamic___R2.fq.gz
data/raw_data/set1/__snakemake_dynamic___R1.fq.gz
[...]

我有一百多個不同的文件，所以我非常想避免創建一個包含所有文件名的列表。 任何想法如何實現這一目標？

Answer 1

我想你誤解了蛇形的工作原理。 當您運行 snakemake 時，您可以在命令行上定義所需的輸出，否則將生成 Snakefile 中第一條規則的輸入（您的規則全部）。 由於您沒有指定任何輸出（ snakemake -np ），Snakemake 將嘗試生成規則 all 的輸入。

你的規則的輸入基本上都是：

"somepath/clipped_and_trimmed_reads/{sample}_R1.fq.gz"

不幸的是，Snakemake 不知道如何從中生成輸出……我們需要告訴 Snakemake 使用哪些文件。 我們可以手動執行此操作：

rule all:
    input:
        "somepath/clipped_and_trimmed_reads/file1_R1.fq.gz",
        "somepath/clipped_and_trimmed_reads/asdf1_R1.fq.gz",
        "somepath/clipped_and_trimmed_reads/zxcv1_R1.fq.gz"

但是隨着我們獲得更多文件，這變得非常麻煩，並且正如您在問題中指定的那樣，這不是您想要的。 我們需要編寫一個小函數來為我們獲取所有文件名。

import glob
import re

data=config["raw_data"]
samples = []
locations = {}
for file in glob.glob(data + "/**", recursive=True):
    if '_R1.fq.gz' in file:
        split = re.split('/|_R1', file)
        filename, directory = split[-2], split[-3]
        locations[filename] = directory  # we will need this one later
        samples.append(filename)

我們現在可以將其提供給我們的規則：

rule all:
    input:
        read1=expand("{output}/clipped_and_trimmed_reads/{sample}_R1.fq.gz", output=config["output"], sample=samples),
        read2=expand("{output}/clipped_and_trimmed_reads/{sample}_R2.fq.gz", output=config["output"], sample=samples)

請注意，我們不再將樣本作為通配符，但我們將其“擴展”到我們的 read1 和 read2 中，從而形成輸出和樣本的所有可能組合。

然而，我們只完成了一半。如果我們像這樣調用 Snakemake，它會確切地知道我們想要哪個輸出，以及哪個規則可以生成這個（規則 clip_and_trim_reads）。 但是它仍然不知道去哪里尋找這些文件。 幸運的是，我們已經有一個字典來存儲這些（存儲在location 中）。

rule clip_and_trim_reads:
    input:
        read1=lambda wildcards: expand("{data}/{set}/{sample}_R1.fq.gz", data=config["raw_data"], set=locations[wildcards.sample], sample=wildcards.sample),
        read2=lambda wildcards: expand("{data}/{set}/{sample}_R2.fq.gz", data=config["raw_data"], set=locations[wildcards.sample], sample=wildcards.sample)
    output:
        ...

現在一切正常！！ 甚至更好； 因為我們所有來自規則 clip_and_trim_reads 的結果都寫入了一個文件夾，所以從這里繼續應該容易得多！

ps 我還沒有測試過任何代碼，所以第一次嘗試時可能並非一切正常。 但是，消息仍然存在。

Snakemake - 從輸入文件動態派生目標

問題描述

1 個解決方案

解決方案1
7 已采納 2019-08-02 19:00:22

Snakemake - 從輸入文件動態派生目標

問題描述

1 個解決方案

解決方案1 7 已采納 2019-08-02 19:00:22

解決方案1
7 已采納 2019-08-02 19:00:22