简体   繁体   中英

Fastest and efficient way of parsing raw data from file

I'm working on some project and I'm wondering which way is the most efficient to read a huge amount of data off a file(I'm speaking of file of 100 lines up to 3 billions lines approx., can be more thought). Once read, data will be stored in a structured data set ( vector<entry> where "entry" defines a structured line).

A structured line of this file may look like : string int int int string string which also ends with the appropriate platform EOL and is TAB delimited

What I wish to accomplish is :

  1. Read file into memory ( string ) or vector<char>
  2. Read raw data from my buffer and format it into my data set.

I need to consider memory footprint and have a fast parsing rate. I'm already avoiding usage of stringstream as they seems too slow.

I'm also avoiding multiple I/O call to my file by using :

// open the stream
std::ifstream is(filename);

// determine the file length
is.seekg(0, ios_base::end);
std::size_t size = is.tellg();
is.seekg(0, std::ios_base::beg);

// "out" can be a std::string or vector<char>
out.reserve(size / sizeof (char));
out.resize(size / sizeof (char), 0);

// load the data
is.read((char *) &out[0], size);

// close the file
is.close();

I've thought of taking this huge std::string and then looping line by line, I would extract line information (string and integer parts) into my data set row. Is there a better way of doing this?

EDIT : This application may run on a 32bit, 64bit computer, or on a super computer for bigger files.

Any suggestions are very welcome.

Thank you

Some random thoughts:

  • Use vector::resize() at the beginning (you did that)
  • Read large blocks of file data at a time, at least 4k, better still 256k. Read them into a memory buffer, parse that buffer into your vector.
  • Don't read the whole file at once, this might needlessly lead to swapping.
  • sizeof(char) is always 1 :)

while i cannot speak for supercomputers with 3 gig lines you will go nowhere in memory on a desktop machine.

i think you should first try to figure out all operations on that data. you should try to design all algorithms to operate sequentially. if you need random access you will do swapping all the time. this algorithm design will have a big impact on your data model.

so do not start with reading all data, just because that is an easy part, but design the whole system with a clear view an what data is in memory during the whole processing.


update
when you do all processing in a single run on the stream and separate the data processing in stages (read - preprocess - ... - write) you can utilize multithreading effectivly.


finally

  • whatever you want to do in a loop over the data, try to keep the number of loops a minimum. averaging for sure you can do in the read loop.
  • immediately make up a test file the size you expect to be the worst case in size and time two different approaches

.

time
loop
    read line from disk
time
loop
    process line (counting words per line)
time
loop
    write data (word count) from line to disk
time

versus.

time
loop
    read line from disk
    process line (counting words per line)
    write data (word count) from line to disk
time

if you have the algorithms already use yours. otherwise make up one (like counting words per line). if the write stage does not apply to your problem skip it. this test does take you less than an hour to write but can save you a lot.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM