简体繁体 English

用C读取任意长度的文件

[英]Reading a file of arbitrary length in C

原文 2018-12-24 17:42:26 4 2 c/ file/ fread

What's the most idiomatic/efficient way to read a file of arbitrary length in C?在 C 中读取任意长度文件的最惯用/最有效的方法是什么？

Get the filesize of the file in bytes and issue a single fread()以字节为单位获取文件的文件大小并发出单个fread()
Keep fread() ing a constant size buffer until getting EOF保持fread()一个恒定大小的缓冲区，直到获得 EOF
Anything else?还要别的吗？

2 个解决方案

Avoid using any technique which requires knowing the size of the file in advance.避免使用任何需要提前知道文件大小的技术。 That leaves exactly one technique: read the file a bit at a time, in blocks of a convenient size.只剩下一种技术：一次读取一点文件，以方便的大小块为单位。

Here's why you don't want to try to find the filesize in advance:这就是您不想尝试提前查找文件大小的原因：

If it is not a regular file, there may not be any way to tell.如果它不是常规文件，则可能无法分辨。 For example, you might be reading directly from a console, or taking piped input from a previous data generator.例如，您可能直接从控制台读取数据，或者从以前的数据生成器获取管道输入。 If your program requires the filesize to be knowable, these useful input mechanisms will not be available to your users, who will complain or choose a different tool.如果您的程序要求文件大小是可知的，那么这些有用的输入机制对您的用户将不可用，他们会抱怨或选择不同的工具。
Even if you can figure out the filesize, you have no way of preventing it from changing while you are reading the file.即使您可以计算出文件大小，您也无法在读取文件时阻止其更改。 If you are not careful about how you read the file, you might open a vulnerability which could be exploited by adversarial programs.如果您不小心阅读文件的方式，您可能会打开一个漏洞，该漏洞可能会被对抗性程序利用。
For example, if you allocate a buffer of the "correct" size and then read until you get an end-of-file condition, you may end up overwriting random memory.例如，如果您分配一个“正确”大小的缓冲区，然后读取直到出现文件结束条件，您最终可能会覆盖随机内存。 (Multiple reads may be necessary if you use an interface like read() which might read less data than requested.) Or you might find that the file has been truncated; （如果您使用read()类的接口可能读取的数据少于请求的数据，则可能需要多次读取。）或者您可能会发现文件已被截断； if you don't check the amount of data read, you might end up processing uninitialised memory leading to information leakage.如果不检查读取的数据量，最终可能会处理未初始化的内存，从而导致信息泄漏。

In practice, you usually don't need to keep the entire file content in memory.实际上，您通常不需要将整个文件内容保存在内存中。 You'll often parse the file (notably if it is textual), or at least read the file in smaller pieces, and for that you don't need it entirely in memory.您经常会解析文件（特别是如果它是文本文件），或者至少以较小的部分读取文件，为此您不需要完全在内存中。 For a textual file, reading it line-by-line (perhaps with some state inside your parser) is often enough (using fgets or getline ).对于文本文件，逐行读取（可能在解析器中包含某种状态）通常就足够了（使用fgets或getline ）。

Files exist (notably on disks or SSD s) because usually they can be much "bigger" than your computer RAM.文件存在（特别是在磁盘或SSD 上），因为它们通常比您的计算机 RAM“大”得多。 Actually, files have been invented (more than 50 years ago) to be able to deal with data larger than memory.实际上，已经发明了文件（50 多年前）能够处理大于内存的数据。 Distributed file systems also can be very big (and accessed remotely even from a laptop, eg by NFS , CIFS , etc...) 分布式文件系统也可以非常大（甚至可以从笔记本电脑远程访问，例如通过NFS 、 CIFS等......）

Some file systems are capable of storing petabytes of data (on supercomputers), with individual files of many terabytes (much larger than available RAM).某些文件系统能够存储 PB 级的数据（在超级计算机上），单个文件的容量为 TB 级（远大于可用 RAM）。

You'll also likely to use some database s.您还可能会使用一些database 。 These routinely have terabytes of data.这些通常具有数 TB 的数据。 See also this answer (about realistic size of sqlite databases).另请参阅此答案（关于sqlite数据库的实际大小）。

If you really want to read a file entirely in memory using stdio (but you should avoid doing that, because you generally want your program to be able to handle a lot of data on files; so reading the entire file in memory is generally a design error), you indeed could loop on fread (or fscanf , or even fgetc ) till end-of-file.如果你真的想使用 stdio 完全读取内存中的文件（但你应该避免这样做，因为你通常希望你的程序能够处理文件上的大量数据；所以读取内存中的整个文件通常是一种设计错误），你确实可以循环fread （或fscanf ，甚至fgetc ）直到文件结束。 Notice that feof is useful only after some input operation.请注意， feof仅在某些输入操作之后才有用。

On current laptop or desktop computers, you could prefer (for efficiency) to use buffers of a few megabytes, and you certainly can deal with big files of several hundreds of gigabytes (much larger than your RAM).在当前的膝上型计算机或台式计算机上，您可能更喜欢（为了效率）使用几兆字节的缓冲区，并且您当然可以处理数百 GB（比您的 RAM 大得多）的大文件。

On POSIX file systems, you might do memory mapped IO with eg mmap(2) - but that might not be faster than read(2) with large buffers (of a few megabytes).在 POSIX 文件系统上，您可以使用例如mmap(2)进行内存映射 IO - 但这可能不会比使用大缓冲区（几兆字节）的read(2)快。 You could use readahead(2) (Linux specific) and posix_fadvise(2) (or madvise(2) if using mmap ) to tune performance by hinting your OS kernel .您可以使用readahead(2) （Linux 特定）和posix_fadvise(2) （或madvise(2)如果使用mmap ）通过提示您的操作系统内核来调整性能。

If you have to code for Microsoft Windows, you could study its WinAPI and find some way to do memory mapped IO.如果您必须为 Microsoft Windows 编写代码，您可以研究它的WinAPI并找到一些方法来进行内存映射 IO。

In practice, file data (notably if it was accessed recently) often stays in the page cache , which is of paramount importance for performance.在实践中，文件数据（尤其是最近访问过的文件）通常保留在页面缓存中，这对性能至关重要。 When that is not the case, your hardware (disk, controller, ...) becomes the bottleneck and your program becomes I/O bound (in that case, no software trick could improve significantly the performance).如果不是这种情况，您的硬件（磁盘、控制器等）就会成为瓶颈，并且您的程序会受到I/O 限制（在这种情况下，没有任何软件技巧可以显着提高性能）。