简体繁体 English

用C ++读取文件

[英]Reading a file in C++

原文 2010-06-01 09:46:30 1 4 c++

I am writing application to monitor a file and then match some pattern in that file. 我正在编写应用程序来监视文件，然后匹配该文件中的某些模式。 I want to know what is the fastest way to read a file in C++ Is reading line by line is faster of reading chunk of the file is faster. 我想知道用C ++读取文件最快的方法是逐行读取更快，读取文件块更快。

4 个解决方案

Your question is more about the performance of hardware, operating systems and run time libraries than it has to do with programming languages. 您的问题更多是关于硬件，操作系统和运行时库的性能，而不是与编程语言有关。 When you start reading a file, the OS is probably loading the file in chunks anyway since the file is stored that way on disk, it makes sense for the OS to load each chunk entirely on first access and caching it rather than reading the chunk, extracting the requested data and discarding the rest. 当您开始读取文件时，由于文件是以这种方式存储在磁盘上的，因此操作系统可能无论如何都以块的形式加载文件，这对于操作系统来说，在首次访问时完全加载每个块并进行缓存而不是读取块是有意义的，提取请求的数据并丢弃其余数据。

Which is faster? 哪个更快？ Line by line or chunk at a time? 一次一行还是一行？ As always with these things, the answer is not something you can predict, the only way to know for sure is to write a line-by-line version and a chunk-at-a-time version and profile them (measure how long it takes each version). 与这些事情一样，答案并不是您可以预测的，要确定的唯一方法就是编写逐行版本和一次块的版本并对其进行概要分析（测量其持续时间）。每个版本）。

您可以尝试使用内存映射文件将文件直接映射到内存，然后使用标准C ++逻辑查找所需的模式。

In general, reading large amounts of a file into a buffer, then parsing the buffer is a lot faster than reading individual lines. 通常，将大量文件读入缓冲区，然后解析缓冲区要比读取单独的行快得多。 The actual proof is to profile code that reads line by line, then profile code reading in large buffers. 实际的证明是先逐行读取配置文件代码，然后再在大型缓冲区中读取配置文件代码。 Compare the profiles. 比较配置文件。

The foundation for this justification is: 这种辩解的基础是：

Reduction of I/O Transactions 减少I / O交易
Keeping the Hard Drive Spinning 保持硬盘旋转
Parsing Memory Is Faster 解析内存更快

I improved the performance of one application from 65 minutes down to 2 minutes, by appling these techniques. 通过应用这些技术，我将一个应用程序的性能从65分钟降低到2分钟。

Reduction of I/O Transactions 减少I / O交易
Reducing the I/O transactions results in few calls to the operating system, reducing time there. 减少I / O事务导致对操作系统的调用很少，从而减少了时间。 Reducing the number of branches in your code; 减少代码中的分支数量； improving the performance of the instruction pipeline in your processor. 改善处理器中指令流水线的性能。 And also reduces traffic to the hard drive. 并且还减少了硬盘驱动器的流量。 The hard drive has less commands to process so it has less overhead. 硬盘驱动器需要处理的命令较少，因此开销也较小。

Keeping the Hard Drive Spinning To access a file, the hard drive has to ramp up the motors to a decent speed (which takes time), position the head to the desired track and sector, and read the data. 保持硬盘驱动器旋转要访问文件，硬盘驱动器必须将电动机提升到适当的速度（这需要时间），将磁头定位到所需的轨道和扇区，然后读取数据。 Positioning the head and ramping up the motor is overhead time required by all transactions. 定位磁头并倾斜电动机是所有事务所需的管理时间。 The overhead in reading the data is very little. 读取数据的开销很小。 The objective is to read as much data as possible in one transaction because this is where the hard drive is most efficient. 目的是在一个事务中读取尽可能多的数据，因为这是硬盘驱动器最有效的地方。 Reducing the number of transactions will reduce the wait times for ramping up the motors and positioning the heads. 减少交易数量将减少等待时间，以增加电动机和定位磁头。

Although modern computers have caches for both data and commands, reducing the quantity will speed things up. 尽管现代计算机同时具有数据和命令的缓存，但减少数量将加快处理速度。 Larger "payloads" will allow more efficient use of the their caches and not require overhead of sorting the requests. 较大的“有效负载”将允许更有效地使用其缓存，而不需要对请求进行排序的开销。

Parsing Memory Is Faster 解析内存更快
Always, reading from memory is faster than reading from an external source. 通常，从内存中读取要比从外部源中读取更快。 Reading a second line of text from a buffer requires incrementing a pointer. 从缓冲区读取第二行文本需要增加一个指针。 Reading a second line from a file requires an I/O transaction to get the data into memory. 从文件读取第二行需要I / O事务才能将数据存入内存。 If your program has memory to spare, haul the data into memory then search the memory. 如果您的程序有可用内存，请将数据拖到内存中，然后搜索内存。

Too Much Data Negates The Performance Savings 太多数据否定了性能节省
There is a finite amount of RAM on the computer for applications to share. 计算机上有有限数量的RAM供应用程序共享。 Accessing more memory than this memory may cause the computer to "page" or forward the request to the hard drive (as known as virtual memory ). 访问比此内存更多的内存可能会导致计算机“分页”或将请求转发到硬盘驱动器（称为虚拟内存 ）。 In this case, there may be little savings because the hard drive is accessed anyway (by the Operating System without knowledge by your program). 在这种情况下，可能几乎没有任何节省，因为无论如何（在程序不了解的情况下，操作系统都可以访问）硬盘驱动器。 Profiling will give you a good indication as to the optimum size of the data buffer. 通过分析可以很好地指示数据缓冲区的最佳大小。

The application I optimized was reading one byte at a time from a 2 GB file. 我优化的应用程序一次从2 GB的文件中读取一个字节。 The performance greatly improved when I changed the program to read 1 MB chunks of data. 当我将程序更改为读取1 MB的数据块时，性能大大提高。 This also allowed for addition performance with loop unrolling. 这还允许通过循环展开来提高性能。

Hope this helps. 希望这可以帮助。

The OS (or even the C++ class you use) probably reads the file in chunks and caches it, even if you read it line by line to improve performance on minimizing disk access (on the operational system point of view would be faster for it to read data from a memory buffer than from a hard disk device). 操作系统（甚至您使用的C ++类）可能会分块读取文件并将其缓存，即使您逐行读取文件以提高将磁盘访问最小化的性能（从操作系统的角度来看，这样做也会更快）从内存缓冲区读取数据而不是从硬盘设备读取数据）。

Notice that a good way to improve performance on your programs (if it is really time critical), is to minimize the number of calls to operational system functions (which manage its resources). 请注意，提高程序性能（如果它确实是时间紧迫的）的一种好方法是减少对操作系统功能（管理其资源）的调用次数。