简体   繁体   English

如何限制 Snakemake 中磁盘空间的使用?

[英]How to limit usage of disk space in Snakemake?

I work with 8 paired-end fastq files with 150 GB each, which need to be processed by a pipeline with space-demanding sub-tasks.我使用 8 个双端 fastq 文件,每个文件 150 GB,这些文件需要由具有空间要求的子任务的管道处理。 I tried several options but I am still running out out disk space:我尝试了几个选项,但磁盘空间仍然不足:

  • used temp to delete output files when not needed anymore不再需要时使用 temp 删除输出文件
  • used disk_mb resources to limit number of parallel jobs.使用 disk_mb 资源来限制并行作业的数量。

I use the following execution to limit my disk space usage to 500GB, but apparently this is not guaranteed and exceeds the 500GB.我使用以下执行将我的磁盘空间使用限制为 500GB,但显然这不能保证并且超过了 500GB。 How to limit the disk usage to a fixed value to avoid running out of disk space ?如何将磁盘使用限制为固定值以避免磁盘空间不足?

snakemake --resources disk_mb=500000 --use-conda --cores 16  -p
rule merge:
  input:
    fw="{sample}_1.fq.gz",
    rv="{sample}_2.fq.gz",
  output:
    temp("{sample}.assembled.fastq")
  resources:
    disk_mb=100000
  threads: 16
  shell:
    """
    merger-tool -f {input.fw} -r {input.rv} -o {output}
    """


rule filter:
  input:
    "{sample}.assembled.fastq"
  output:
    temp("{sample}.assembled.filtered.fastq")
  resources:
    disk_mb=100000
  shell:
    """
    filter-tool {input} {output}
    """


rule mapping:
  input:
    "{sample}.assembled.filtered.fastq"
  output:
    "{sample}_mapping_table.txt"
  resources:
    disk_mb=100000
  shell:
    """
    mapping-tool {input} {output}
    """

Snakemake does not have the functionality to constrain resources, but can only schedule jobs in a way that respects resource constraints. Snakemake没有限制资源的功能,但只能以尊重资源限制的方式安排作业。

Now, snakemake uses resources to limit concurrent jobs, while your problem has a cumulative aspect to it.现在, snakemake使用resources来限制并发作业,而您的问题具有累积性。 Taking a look at this answer , one way to resolve this is to introduce priority , so that downstream tasks have highest priority.看一下这个答案,解决这个问题的一种方法是引入priority ,以便下游任务具有最高优先级。

In your particular file, it seems that adding priority to the mapping rule should be sufficient:在您的特定文件中,似乎为mapping规则添加priority就足够了:

rule mapping:
    input:
        "{sample}.assembled.filtered.fastq"
    output:
        "{sample}_mapping_table.txt"
    resources:
        disk_mb=100_000
    priority: 100
    shell:
        """
        mapping-tool {input} {output}
        """

You might also want to be careful about launching the rule initially (to avoid filling up the disk space with results of merge ).您可能还希望在最初启动规则时要小心(以避免用merge的结果填满磁盘空间)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM