如何在 snakemake 的 expand function 参数中使用通配符？

Question

我有一个 json 文件，如下所示：

{
    "foo": {
        "bar1": 
            {"A1": {"name": "A1", "path": "/path/to/A1"}, 
             "B1": {"name": "B1", "path": "/path/to/B1"},
             "C1": {"name": "C1", "path": "/path/to/C1"},
             "D1": {"name": "D1", "path": "/path/to/D1"}},
        "bar2": 
            {"A2": {"name": "A2", "path": "/path/to/A2"}, 
             "B2": {"name": "B2", "path": "/path/to/B2"},
             "C2": {"name": "C2", "path": "/path/to/C2"},
             "D2": {"name": "D2", "path": "/path/to/D2"}}}
}

我正在尝试分别对样本集“bar1”和“bar2”中的样本运行我的 snakemake 管道，将结果放入它们自己的文件夹中。 当我扩展我的通配符时，我不想要样本集和样本的所有迭代，我只希望它们在它们的特定组中，如下所示：

tmp/bar1/A1.bam
tmp/bar1/B1.bam
tmp/bar1/C1.bam
tmp/bar1/D1.bam
tmp/bar2/A2.bam
tmp/bar2/B2.bam
tmp/bar2/C2.bam
tmp/bar2/D2.bam

希望我的 snakefile 能帮助解释。 我试过这样设置我的 snakefile：

sample_sets = [ i for i in config['foo'] ]

samples_dict = config['foo'] #cleans it up

def get_samples(wildcards):
    return list(samples_dict[wildcards.sample_set].keys())

rule all:
    input:
        expand(expand("tmp/{{sample_set}}/{sample}.bam", sample = get_samples), sample_set = sample_sets),

这不起作用，我的文件名以“<function get_samples at 0x7f6e00544320>”结尾：我也试过：

rule all:
    input:
        expand(expand("tmp/{{sample_set}}/{sample}.bam", sample = list(samples_dict["{{sample_set}}"].keys()), sample_set = sample_sets),

但这是一个 KeyError。 也试过这个：

rule all:
    input:
        [ ["tmp/{{sample_set}}/{sample}.aligned_bam.core.bam".format( sample = sample ) for sample in list(samples_dict[sample_set].keys())] for sample_set in sample_sets ]

出现“无法从 output 文件确定输入文件中的通配符：‘sample_set’”错误。

我觉得一定有一种简单的方法可以做到这一点，也许我是个白痴。

任何帮助将不胜感激。 如果我错过了一些细节，请告诉我。

Answer 1

有可能在 expand 中使用自定义组合 function 。 大多数情况下，这个 function 是zip ，但是，在您的情况下，嵌套字典形状将需要设计自定义 function。相反，更简单的解决方案是使用 Python 来构造所需文件的列表。

d = {
    "foo": {
        "bar1": {
            "A1": {"name": "A1", "path": "/path/to/A1"},
            "B1": {"name": "B1", "path": "/path/to/B1"},
            "C1": {"name": "C1", "path": "/path/to/C1"},
            "D1": {"name": "D1", "path": "/path/to/D1"},
        },
        "bar2": {
            "A2": {"name": "A2", "path": "/path/to/A2"},
            "B2": {"name": "B2", "path": "/path/to/B2"},
            "C2": {"name": "C2", "path": "/path/to/C2"},
            "D2": {"name": "D2", "path": "/path/to/D2"},
        },
    }
}

list_files = []

for key in d["foo"]:
    for nested_key in d["foo"][key]:
        _tmp = f"tmp/{key}/{nested_key}.bam"
        list_files.append(_tmp)

print(*list_files, sep="\n")
#tmp/bar1/A1.bam
#tmp/bar1/B1.bam
#tmp/bar1/C1.bam
#tmp/bar1/D1.bam
#tmp/bar2/A2.bam
#tmp/bar2/B2.bam
#tmp/bar2/C2.bam
#tmp/bar2/D2.bam

Answer 2

@SultanOrazbayev 有权这样做，但只是提出几个替代方案。

如果您喜欢这些循环，那么编写它的 Pythonic 方式就是使用列表推导式。 如果您有巨大的文件列表，您可能会注意到性能有所提高。

list_files = [
    f"tmp/{key}/{nested_key}.bam"
    for key in d["foo"]
    for nested_key in d["foo"][key]
]

我认为使用 expand 的唯一方法基本上是构建相同的列表。 我将它作为字典传入，也保留通配符名称，尽管元组会更有效率。 expand 的优点是，如果您将文件名放在配置变量中并且无法轻松格式化它，想要保留有意义的通配符名称，或者对其他通配符使用 allow_missing ：

wcs = [{'sample_set': sample_set, 'sample': sample}
    for sample_set in d["foo"]
    for sample in d["foo"][sample_set]
    ]


list_files = expand("tmp/{sample_set}/{sample}.bam", zip, 
        sample_set=[wc['sample_set'] for wc in wcs],
        sample=[wc['sample'] for wc in wcs],
        )

有时 snakemake 方式不是 pythonic！

如何在 snakemake 的 expand function 参数中使用通配符？

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-02-22 13:11:26

解决方案2
0 2022-02-23 14:23:03

如何在 snakemake 的 expand function 参数中使用通配符？

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-02-22 13:11:26

解决方案2 0 2022-02-23 14:23:03

解决方案1
1 已采纳 2022-02-22 13:11:26

解决方案2
0 2022-02-23 14:23:03