如何使用扩展输入为 snakemake 规则定义参数

Question

我有这种格式的输入文件：

dataset1/file1.bam
dataset1/file1_rmd.bam
dataset1/file2.bam
dataset1/file2_rmd.bam

我想对每个命令运行一个命令并将结果合并到一个 csv 文件中。 该命令返回 integer 给定的文件名。

$ samtools view -c -F1 dataset1/file1.bam
200

我想运行命令并将每个文件的 output 合并到以下 csv 文件中：

file1,200,100
file2,400,300

我可以在不使用输入扩展和使用 append 运算符>>的情况下执行此操作，但为了避免可能的文件损坏，它可能导致我想使用> 。

我试过这样的东西，但由于wildcards.param2部分而不起作用：

rule collect_rc_results:
    input: in1=expand("{param1}/{param2}.bam", param1=PARS1, param2=PARS2),
            in2=expand("{param1}/{param2}_rmd.bam", param1=PARS1, param2=PARS2)
    output: "{param1}_merged.csv"
    shell:
        """
        RCT=$(samtools view -c -F1  {input.in1})
        RCD=$(samtools view -c -F1  {input.in2})
        printf "{wildcards.param2},${{RCT}},${{RCD}}\n" > {output}
        """

我知道输入不再是单个文件，而是expand创建的文件列表。 因此，我定义了一个 function 来处理列表输入，但它仍然不太正确：

def get_read_count:
        return [ os.popen("samtools view -c -F1 "+infile).read() for infile in infiles ]

rule collect_rc_results:
    input: in1=expand("{param1}/{param2}.bam", param1=PARS1, param2=PARS2),
            in2=expand("{param1}/{param2}_rmd.bam", param1=PARS1, param2=PARS2)
    output: "{param1}_merged.csv"
    params: rc1=get_read_count("{param1}/{param2}.bam"), rc2=get_read_count("{param1}/{param2}_rmd.bam")
    shell:
        """
        printf "{wildcards.param2},{params.rc1},{params.rc2}\n" > {output}
        """

当使用扩展定义输入文件列表时，在输入文件 ID 中使用通配符的最佳做法是什么？

编辑：如果我使用外部 bash 脚本，例如script.sh ，我可以通过扩展获得预期的结果

for INF in "${@}";do
        IN1=${INF}
        IN2=${IN1%.bam}_rmd.bam
        LIB=$(basename ${IN1%.*}|cut -d_ -f1)
        RCT=$(samtools view -c -F1 ${IN1} )
        RCD=$(samtools view -c -F1 ${IN2} )
        printf "${LIB},${RCT},${RCD}\n"
done

和

        params: script="script.sh"
        shell:
                """
                bash {params.script} {input} > {output} 
                """

但我有兴趣了解是否有更简单的方法来仅使用 snakemake 获得相同的 output。

编辑2：

我也可以在shell而不是单独的脚本中完成，

rule collect_rc_results:
        input: 
        in1=expand("{param1}/{param2}.bam", param1=PARS1, param2=PARS2),
        in2=expand("{param1}/{param2}_rmd.bam", param1=PARS1, param2=PARS2)
        output: "{param1}_merged.csv"
        shell:
            """
            for INF in {input};do
                IN1=${{INF}}
                IN2=${{IN1%.bam}}_rmd.bam
                LIB=$(basename ${{IN1%.*}}|cut -d_ -f1)
                RCT=$(samtools view -c -F1 ${{IN1}} )
                RCD=$(samtools view -c -F1 ${{IN2}} )
                printf ${{LIB}},${{RCT}},${{RCD}}\n"
            done > {output}
            """

从而获得预期的文件。 但是，如果有人有更优雅或“最佳实践”的解决方案，我很想听听。

Answer 1

我认为您当前的解决方案没有任何问题，但我更倾向于使用带有 shell 函数的运行指令来执行循环。

@bli 使用临时文件的建议也很好，特别是如果中间步骤（在本例中为 samtools）运行时间很长； 您可以通过并行化这些计算获得巨大的时钟收益。 缺点是您将创建很多小文件。

我注意到您的输入通过扩展完全合格，但根据您的示例，我认为您希望将 param1 保留为通配符。 假设PARS2是一个列表，把zip in1、in2和PARS2放在一起应该是安全的。 这是我的看法（已编写但未经测试）。

rule collect_rc_results:
    input: 
        in1=expand("{param1}/{param2}.bam", param2=PARS2, allow_missing=True),
        in2=expand("{param1}/{param2}_rmd.bam", param2=PARS2, allow_missing=True)
    output: "{param1}_merged.csv"
    run:
        with open(output[0], 'w') as outfile:
            for infile1, infile2, parameter in zip(in1, in2, PARS2):
                # I don't usually use shell, may have to strip newlines from this output?
                RCT = shell(f'samtools view -c -F1 {infile1}')
                RCD = shell(f'samtools view -c -F1 {infile2}')
                outfile.write(f'{parameter},{RCT},{RCD}\n')

如何使用扩展输入为 snakemake 规则定义参数

问题描述

1 个解决方案

解决方案1
1 2022-05-02 13:00:10

如何使用扩展输入为 snakemake 规则定义参数

问题描述

1 个解决方案

解决方案1 1 2022-05-02 13:00:10

解决方案1
1 2022-05-02 13:00:10