简体   繁体   中英

High performance reading - linux/pthreads

I have moderately large binary file consisting of independent blocks like this:

header1
data1
header2
data2
header3
data3
...

The number of blocks, the size of each block and the total size of the file vary quite a lot, but typical numbers are ~1000 blocks and average blocksize 100kb. The files are generated by an external application which I have no control over, but I want to read them as fast as possible. In many cases I am only interested in a fraction (ie 10 %) of the blocks, and this is the case I will optimize for.

My current implementation is like this:

  1. Open the file and read all the headers - using information in the header to fseek() to the next header location; retain an open FILE * pointer.
  2. When data is requested use fseek() to locate the data block, read all the data and return it.

This works fine - but I was thinking maybe(?) it was possible to speed things up using eg aio, mmap or other techniques I have only heard of.

Any thoughts?

Joakim

The speed difference between mmap and read is not that big (both need to read the data from disk), the biggest advantage of mmap is avoiding the double buffering.

If you are only interested in 10% of the contents, your biggest saving will be to not read the other 90%. This could be done by only reading the headers, and seeking to the next header or to the data block wanted. But it all depends on the fileformat, which the OP did not show in detail.

Most of the time is probably spent in accessing the disk. So perhaps buying an SSD is sensible. (Whatever you do, your application is I/O bound).

Apparently, your file is only about 100Mb. You could get it on disk (kernel file) cache just by reading it, eg with cat yourfile > /dev/null before running your program. For such a small file (on a reasonable machine it fits in RAM), I won't worry that much.

You could pre-process the text file, eg to make a database (for sqlite , or a real RDBMS like PostGreSQL) or just a gdbm indexed file.

If using <stdio.h> you might have a bigger buffer with setbuffer , or call fopen with a "rmt" mode (the m is a GNU Glibc extension to ask mmap -ing it).

You could use mmap with madvise .

You could (perhaps in a separate thread) use the readahead syscall.

But your file seems small enough that you should not bother that much. Are you sure it is really a performance issue? Do you read that file many thousand times per day, or do you have many hundreds of such files?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM