简体   繁体   中英

Multi threaded reading from a file in c++?

My application uses text file to store data to file. I was testing for the fastest way of reading it by multi threading the operation. I used the following 2 techniques:

  1. Use as many streams as NUMBER_OF_PROCESSORS environment variable. Each stream is on a different thread. Divide total no of lines in file equally for each stream. Parse the text.

  2. Only one stream parses the entire file and loads the data in memory. Create threads (= NUMBER_OF_PROCESSORS - 1) to parse data from memory.

The test was run on various file sizes 100kB - 800MB. Data in file:

100.23123 -42343.342555 ...(and so on)
4928340 -93240.2 349 ...
...

The data is stored in 2D array of double .

Result: Both methods take approximately the same time for parsing the file.

Question: Which method should I choose?

Method 1 is bad for the Hard disk as multiple read access are performed at random locations simultaneously.

Method 2 is bad because memory required is proportional to file size. This can be partially overcome by limiting the container to a fixed size, deleting the parsed content and fill it again from the reader. But this increases the processing time.

Method 2 has a sequential bottleneck (the single-threaded reading and handing out of the work items). This will not scale indefinitely according to Amdahls Law. It is a very fair and reliable method, though.

Method 1 has not bottleneck and will scale. Be sure to not cause random IO on the disk. I'd use a mutex to have only one thread read at a time. Read in big sequential block of maybe 4-16MB. In the time the disk does a single head seek it could have read about 1MB of data.

If parsing the lines takes a considerable amount of time, you can't use method 2 because of the big sequential part. It would not scale. If parsing is fast, though, use method 2 because it is easier to get right.

To illustrate the concept of a bottleneck: Imagine 1.000.000 computation threads asking one reader thread to give them lines. That one reader thread would not be able to keep up handing out lines as quickly as they are demanded. You would not get 1e6 times the throughput. This would not scale. But if 1e6 threads read independently from a very fast IO device, you would get 1e6 times the throughput because there is no bottleneck. (I have used extreme numbers to make the point. The same idea applies in the small.)

I'd prefer slightly modified 2 method. I would read data sequentally in single thread by big chunks. Ready chunk is passed to a thread pool where data is processed. So you have concurrent reading & processing

With enough RAM you can do it without single-thread bottleneck. For Linux:

1) mmap you whole file to RAM with MAP_LOCKED, requires root or system wide permissions tune. Or without MAP_LOCKED for SSD, they handle random access well.

2) give each thread a start position. Process data from first newline after self start position to first newline after next thread start position.

PS What is your program CPU load? Probably HDD is the bottleneck.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM