简体   繁体   English

如何有效地写入大量文件

[英]How to efficiently write to a large number of files

I am trying to write a program to split a large collection of gene sequences into many files based on values inside a certain segment of each sequence. 我正在尝试编写一个程序,根据每个序列的特定片段内的值将大量的基因序列集合拆分为许多文件。 For example the sequences might look like 例如,序列可能看起来像

AGCATGAGAG...
GATCAGGTAA...
GATGCGATAG...
... 100 million more

The goal is then to split the reads into individual files based on the sequences from position 2 to 7 (6 bases). 然后,目标是根据从位置2到7(6个碱基)的序列将读取的片段分成单个文件。 So we get something like 所以我们得到类似

AAAAAA.txt.gz
AAAAAC.txt.gz
AAAAAG.txt.gz
...4000 more

Now naively I have implemented a C++ program that 现在我天真地实现了一个C ++程序

  • reads in each sequence 按每个顺序读取
  • opens the relevant file 打开相关文件
  • writes in the sequence 按顺序写
  • closes the file 关闭文件

Something like 就像是

#include <zlib.h>

void main() {
    SeqFile seq_file("input.txt.gz");
    string read;

    while (read = seq_file.get_read) {
        string tag = read.substr(1, 6);
        output_path = tag + "txt.gx";

        gzFile output = gzopen(output_path.c_str(), "wa");
        gzprintf(output, "%s", read);
        gzclose(output);
    }
}

This is unbearably slow compared to just writing the whole contents into a single other file. 与仅将全部内容写入单个其他文件相比,这是极其缓慢的。

What is the bottleneck is this situation and how might I improve performance given that I can't keep all the files open simultaneously due to system limits? 这种情况的瓶颈是什么,并且由于系统限制而无法同时打开所有文件,该如何提高性能?

Since opening a file is slow, you need to reduce the number of files you open. 由于打开文件的速度很慢,因此您需要减少打开的文件数。 One way to accomplish this is to make multiple passes over your input. 实现此目的的一种方法是对您的输入进行多次传递。 Open a subset of your output files, make a pass over the input and only write data to those files. 打开输出文件的子集,对输入进行传递,然后仅将数据写入这些文件。 When you're done, close all those files, reset the input, open a new subset, and repeat. 完成后,关闭所有这些文件,重置输入,打开新的子集,然后重复。

The bottleneck is opening and closing of the output file. 瓶颈是打开和关闭输出文件。 If you can move this out of the loop somehow, eg by keeping multiple output files open simultaneously, you program should speed up significantly. 如果您可以某种方式将其移出循环,例如通过同时打开多个输出文件,则您的程序应该可以大大提高速度。 In the best case scenario it is possible to keep all 4096 files open at the same time but if you hit some system barrier even keeping a smaller number of files open and doing multiple passes through the file should be faster that opening and closing files in the tight loop. 在最佳情况下,可以同时打开所有4096个文件,但如果遇到一些系统障碍,即使保持较小数量的文件打开,并且多次通过该文件也应比打开和关闭文件中的文件更快。紧密循环。

The compressing might be slowing the writing down, writing to text files then compressing could be worth a try. 压缩可能会减慢写入速度,写入文本文件,然后进行压缩可能值得一试。

Opening the file is a bottleneck. 打开文件是一个瓶颈。 Some of the data could be stored in a container and when it reaches a certain size write the largest set to the corresponding file. 某些数据可以存储在容器中,当达到一定大小时,将最大的数据集写入相应的文件。

I can't actually answer the question - because to do that, I would need to have access to YOUR system (or a reasonably precise replica). 我实际上无法回答这个问题-因为这样做,我需要访问您的系统(或相当精确的副本)。 The type of disk and how it is connected, how much and type of memory and model/number of CPU will matter. 磁盘的类型及其连接方式,内存的大小和类型以及CPU的型号/数量将很重要。

However, there are a few different things to consider, and that may well help (or at least tell you that "you can't do better than this"). 但是,有几件事情需要考虑,并且可能会有所帮助(或至少告诉您“您不能做得更好”)。

First find out what takes up the time: CPU or disk-I/O? 首先找出占用时间的是:CPU还是磁盘I / O?

Use top or system monitor or some such to measure what CPU-usage your application uses. 使用topsystem monitor或类似的system monitor来衡量您的应用程序使用的CPU使用率。

Write a simple program that writes a single value (zero?) to a file, without zipping it, for a similar size to what you get in your files. 编写一个简单的程序,该程序将单个值(零?)写入文件而不进行压缩,其大小与文件中的大小类似。 Compare this to the time it takes to write your gzip-file. 将其与编写gzip文件所需的时间进行比较。 If the time is about the same, then you are I/O-bound, and it probably doesn't matter much what you do. 如果时间大致相同,则说明您受I / O限制,您所做的事情可能并不重要。

If you have lots of CPU-usage, you may want to split the writing work into multiple threads - you obviously can't really do that with the reading, as it has to be sequential (reading gzip in mutliple threads is not easy, if at all possible, so let's not try that). 如果您有大量的CPU使用率,则可能需要将编写工作分成多个线程-显然不能真正做到用读取来完成,因为它必须是连续的(在多个线程中读取gzip并不容易,如果尽可能避免,所以不要尝试)。 Using one thread per CPU-core, so if you have 4 cores, use 1 to read, and three to write. 每个CPU内核使用一个线程,因此,如果您有4个内核,则使用1个线程读取,而使用3个线程写入。 You may not get 4 times the performance, but you should get a good improvement. 您可能不会获得4倍的性能,但是应该得到很好的改进。

Quite certainly, at some point, you will be bound by the speed of the disk. 当然,在某个时候,您将受到磁盘速度的束缚。 Then the only option is to buy a better disk (if you haven't already got that!) 然后,唯一的选择是购买更好的磁盘(如果您还没有)!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM