parsing a mmap()-ed file

Question

What would be the best (fastest) way to parse through a mmap-ed file? It contains pairs of data (string int), but I cannot persume number of whitespaces/tabs/newlines between them.

Answer 1

Assuming you've mmaped the whole file in (rather than chunks - as that would make life awefully complicated), I'd do something like the following...

// Effectively this wraps the mmaped block
std::istringstream str;
str.rdbuf()->pubsetbuf(<pointer to start of mmaped block>, <size of mmaped block>);

std::string sv;
std::string iv;

while(str >> sv >> iv)
{
  // do stuff...
}

I think that should work...

WARNING This is implementation defined behaviour, see this answer for an altogether better approach.

Answer 2

If by best/fastest you mean easiest to code, then this is one of those rare occasions where the deprecated std::istrstream fits the bill perfectly; call the istrstream::istrstream(char const*, std::streamsize) constructor overload then extract the data from the stream as you would from any other std::istream . (This won't duplicate the underlying memory like std::istringstream will.)

If by best/fastest you mean best/fastest runtime performance, I don't think you'll be able to beat boost . spirit . qi or a handwritten parser, though the former would be much easier to write and maintain in my opinion (library learning curve aside, if you've never used boost.spirit before).

Answer 3

Parsing string/integer pairs (ie foo 50 bar 20 baz 123) separated by whitespace should be lightning fast either way. The by far more important factors will be that
a) the pages are actually in RAM, which mmap alone does not guarantee
b) cache lines are in the L1 cache

While mmap does already read ahead by default on sequential access, disk access is in the tens of milliseconds ,and parsing over a 4k page of memory is (ideally) in the tens of microseconds.
So, you cannot expect the prefetcher to keep pace, especially since it will only prefetch whenever it looks like you will need more (which, even assuming seek time is zero, practically guarantees an upfront cost due to rotational delay on a mechanical disk).
Therefore, unless your total data is only a dozen kilobytes (in which case the question about how to do it as fast as possible would be pointless, anyway), it makes sense to madvise(MADV_WILLNEED) before you start your scan, so the operating system won't wait to see its heuristics triggered by your access pattern, but reads in sequentially what it can without cease. Disk bandwidth (sequentially), is huge once you're past the access time. You will still probably catch up, but much later. If your dataset is large enough so it will probably not fit into RAM, calling MADV_DONTNEED on data you've already seen every now and then is a good idea.

The same that is true for page faults is true for cache misses. A load from cache is 1-2 cycles, a load from memory is something around 200-500 cycles.
CPUs have automatic prefetching for sequential access patterns, however they are limited.
First, prefetching never occurs across a page boundary. That is because if this were the case, then automatic prefetching would regularly trigger page faults which would be very unpleasing.
Second, prefetching happens only after two consecutive misses, this is to ensure that prefetching really only kicks in when it probably makes sense. Prefetching the adjacent cache lines for every random read would be stupid as it would needlessly trash valuable cache lines.
Third, prefetching takes time, and once the heuristics in the CPU trigger, you're already racing it for the data, so sooner is better than later.
Luckily, you know what data you will be wanting, and you know it a long time ahead. Therefore, you can give prefetch hints, which will give the CPU a valuable head start (prefetch eg half a kilobyte ahead).

Answer 4

As it stands, your question is too vague to be answered.

Nonetheless, if all you need to do is to get some data out of the file, what you don't want to do is to use a method that would modify memory in the mmap ed region.

Edit It's much clearer now that you've edited the question. As a starting point, I'd use a single char pointer to iterate over the entire mmap ed file. Extracting strings is very straightforward (the exact method depends on what you need to do with the result) and the integers can be extracted with atoi et al.

Answer 5

You could access it via a std::string and use std::istringstream in order to read from it sequentially. Or use some more convenient library, eg in Qt you could use a QTextStream on a QByteArray constructed from the mmaped memory.

parsing a mmap()-ed file

Question

5 answers

solution1
4 ACCPTED 2011-03-01 12:00:06

solution2
2 2011-03-01 11:47:39

solution3
2 2011-03-01 13:05:29

solution4
0 2011-03-01 11:35:47

solution5
0 2011-03-01 11:37:45

parsing a mmap()-ed file

Question

5 answers

solution1 4 ACCPTED 2011-03-01 12:00:06

solution2 2 2011-03-01 11:47:39

solution3 2 2011-03-01 13:05:29

solution4 0 2011-03-01 11:35:47

solution5 0 2011-03-01 11:37:45

solution1
4 ACCPTED 2011-03-01 12:00:06

solution2
2 2011-03-01 11:47:39

solution3
2 2011-03-01 13:05:29

solution4
0 2011-03-01 11:35:47

solution5
0 2011-03-01 11:37:45