简体   繁体   中英

What is the most efficient way to read formatted data from a large file?

Options: 1. Reading the whole file into one huge buffer and parsing it afterwards. 2. Mapping the file to virtual memory. 3. Reading the file in chunks and parsing them one by one.

The file can contain quite arbitrary data but it's mostly numbers, values, strings and so on formatted in certain ways (commas, brackets, quotations, etc). Which option would give me greatest overall performance?

如果文件非常大 ,那么您可以考虑使用带有选项2或3的多个线程。每个线程可以处理单个文件/内存块,您可以通过这种方式重叠IO和计算(解析)。

It's hard to give a general answer to your question as choosing the "right" strategy heavily depends on the organization of the data you are reading.

Especially if there's a really huge amount of data to be processed options 1. and 2. won't work anyways as the available amount of main memory poses an upper limit to any attempt like this.

Most probably the biggest gain in terms of efficiency can be accomplished by (re)structuring the data you are going to process.

Checking if there is any chance to organize the data in a way to save from needlessly processing whole chunks would be the primary spot I'd try to improve upon before addressing the problem mentioned in the question.

In terms of efficiency there's nothing but a constant to win in choosing any of the mentioned methods while on the other hand there might be much better improvement with the right organization of your data. The bigger the data the more important your decision will get.

Some facts about the data that seem interesting enough to take into consideration include:

  • Is there any regular pattern to the data you are going to process ?
  • Is the data mostly static or highly dynamic?
  • Does it have to be parsed sequentially or is it possible to process data in parallel?

It makes no sense to read the entire file all at once and then convert from text to binary data; it's more convenient to write, but you run out of memory faster. I would read the text in chunks and convert as you go. The converted data, in binary format instead of text, will likely take up less space than the original source text anyway.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM