简体   繁体   中英

Writing a single large data file, or multiple smaller files: Which is faster?

I am developing a c++ program that writes a large amount of data to disk. The following function gzips the data and writes it out to a file. The compressed data is on the order of 100GB. The function to compress and write out the data is as follows:

 void constructSNVFastqData(string const& fname) {                                 
   ofstream fastq_gz(fname.c_str());                                           
   stringstream ss;                                                                
   for (int64_t i = 0; i < snvId->size(); i++) {                            
     consensus_pair &cns_pair = snvId->getPair(i);                                 
     string qual(cns_pair.non_mutated.size(), '!');                                
      ss << "@" + cns_pair.mutated + "[" + to_string(cns_pair.left_ohang) +         
            ";" + to_string(cns_pair.right_ohang) + "]\n" 
             + cns_pair.non_mutated + "\n+\n" + qual + "\n";                                            
   }                                                                               
   boost::iostreams::filtering_streambuf<boost::iostreams::input> out;             
   out.push(boost::iostreams::gzip_compressor());                                  
   out.push(ss);                                                                   
   boost::iostreams::copy(out,fastq_gz);                                           
   fastq_gz.close();                                                                                                 
 }

The function writes data to a string stream, which I then write out to a file ( fastq_gz ) using boost's filtering_streambuf . The file is not a log file. After the file has been written it will be read by a child process. The file does not need to be viewed by humans.

Currently, I am writing the data out to a single, large file ( fastq_gz ). This is taking a while, and the file system - according to our system manager - is very busy. I wonder if, instead of writing out a single large file, I should instead write out a number of smaller files? Would this approach be faster, or reduce the load on the file system?

Please note that it is not the compression that is slow - I have benchmarked.

I am running on a linux system and do not need to consider generalising the implementation to a windows filesystem.

So what your code is probably doing is (a) generating your file into memory swap space, (b) loading from swap space and compressing on the fly, (c) writing compressed data as you get it to the outfile.

(b) and (c) are great; (a) is going to kill you. It is two roundmtrips of thr uncompressed data, one of which while competing with your output file generation.

I cannot find one in boost iostreams, but you need an istream (source) or a device that gets data from you on demand. Someone must have written it (it seems so useful), but I don't see it in 5 minutes of looking at boost iostreams docs.

0.) Devise an algorithm to divide the data into multiple files so that it could be recombined later. 1.) Write data to multiple files on separate threads in the background. Maybe shared threads. (maybe start n = 10 threads at a time or so) 2.) Query through the future attribute of the shared objects to check if writing is done. (size > 1 GB) 3.) Once above is the case; then recombine data when it is queried by the child process 4.) I would recommend writing a new file after every 1 GB

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM