写入单个大数据文件或多个小文件：哪个更快？

Question

I am developing a c++ program that writes a large amount of data to disk. 我正在开发将大量数据写入磁盘的c ++程序。 The following function gzips the data and writes it out to a file. 以下函数gzip压缩数据并将其写出到文件中。 The compressed data is on the order of 100GB. 压缩后的数据约为100GB。 The function to compress and write out the data is as follows: 压缩和写出数据的功能如下：

 void constructSNVFastqData(string const& fname) {                                 
   ofstream fastq_gz(fname.c_str());                                           
   stringstream ss;                                                                
   for (int64_t i = 0; i < snvId->size(); i++) {                            
     consensus_pair &cns_pair = snvId->getPair(i);                                 
     string qual(cns_pair.non_mutated.size(), '!');                                
      ss << "@" + cns_pair.mutated + "[" + to_string(cns_pair.left_ohang) +         
            ";" + to_string(cns_pair.right_ohang) + "]\n" 
             + cns_pair.non_mutated + "\n+\n" + qual + "\n";                                            
   }                                                                               
   boost::iostreams::filtering_streambuf<boost::iostreams::input> out;             
   out.push(boost::iostreams::gzip_compressor());                                  
   out.push(ss);                                                                   
   boost::iostreams::copy(out,fastq_gz);                                           
   fastq_gz.close();                                                                                                 
 }

The function writes data to a string stream, which I then write out to a file ( fastq_gz ) using boost's filtering_streambuf . 该功能将数据写入字符串流，然后我写了一个文件（ fastq_gz使用升压转换器的） filtering_streambuf 。 The file is not a log file. 该文件不是日志文件。 After the file has been written it will be read by a child process. 写入文件后，子进程将读取该文件。 The file does not need to be viewed by humans. 该文件不需要人类查看。

Currently, I am writing the data out to a single, large file ( fastq_gz ). 目前，我正在将数据写到单个大文件（ fastq_gz ）中。 This is taking a while, and the file system - according to our system manager - is very busy. 这需要一段时间，而且文件系统（根据我们的系统经理）非常繁忙。 I wonder if, instead of writing out a single large file, I should instead write out a number of smaller files? 我想知道是否不应该写出多个较小的文件，而不是写出一个大文件？ Would this approach be faster, or reduce the load on the file system? 这种方法会更快还是减少文件系统上的负载？

Please note that it is not the compression that is slow - I have benchmarked. 请注意，并不是压缩很慢-我已经进行了基准测试。

I am running on a linux system and do not need to consider generalising the implementation to a windows filesystem. 我在linux系统上运行，不需要考虑将实现推广到Windows文件系统。

Answer 1

So what your code is probably doing is (a) generating your file into memory swap space, (b) loading from swap space and compressing on the fly, (c) writing compressed data as you get it to the outfile. 因此，您的代码可能正在做的是（a）将文件生成到内存交换空间中，（b）从交换空间中加载并即时压缩，（c）在将压缩数据获取到外文件时写入压缩数据。

(b) and (c) are great; （b）和（c）很棒； (a) is going to kill you. （a）要杀了你。 It is two roundmtrips of thr uncompressed data, one of which while competing with your output file generation. 这是两次未压缩数据的往返，其中之一与输出文件生成竞争。

I cannot find one in boost iostreams, but you need an istream (source) or a device that gets data from you on demand. 我在boost iostream中找不到一个，但是您需要一个istream（源）或一台可以按需从您那里获取数据的设备。 Someone must have written it (it seems so useful), but I don't see it in 5 minutes of looking at boost iostreams docs. 一定有人写过（它似乎很有用），但是在看Boost iostreams文档的5分钟内我看不到它。

Answer 2

0.) Devise an algorithm to divide the data into multiple files so that it could be recombined later. 0.）设计一种算法，将数据分成多个文件，以便以后可以重新组合。 1.) Write data to multiple files on separate threads in the background. 1.）将数据写入后台的单独线程中的多个文件中。 Maybe shared threads. 也许共享线程。 (maybe start n = 10 threads at a time or so) 2.) Query through the future attribute of the shared objects to check if writing is done. （也许一次启动n = 10个线程左右）。2.）查询共享对象的future属性，以检查是否已完成写入。 (size > 1 GB) 3.) Once above is the case; （大小> 1 GB）3.）出现上述情况； then recombine data when it is queried by the child process 4.) I would recommend writing a new file after every 1 GB 然后在子进程查询时重新组合数据。4）我建议每隔1 GB写入一个新文件

写入单个大数据文件或多个小文件：哪个更快？

问题描述

2 个解决方案

解决方案1
0 2018-03-12 20:31:34

解决方案2
-2 2018-03-12 20:01:10

写入单个大数据文件或多个小文件：哪个更快？

问题描述

2 个解决方案

解决方案1 0 2018-03-12 20:31:34

解决方案2 -2 2018-03-12 20:01:10

解决方案1
0 2018-03-12 20:31:34

解决方案2
-2 2018-03-12 20:01:10