简体繁体 English

选择何时进行 mmap 与在编译器中读取文件

[英]Choosing when to mmap vs read a file in a compiler

原文 2022-03-14 05:59:00 0 1 c/ linux/ macos/ io

I know there are many similar questions to this one which have already been asked here, but I have a somewhat more nuanced question than what I could generally find on the web. I'm currently working on somewhat of a toy C compiler just for fun, but I did want to see what type of performance I could get out of it if I were able to focus only on one version of the core language standard without a lot of optional features provided by ex.我知道这里有许多与此问题类似的问题，但我有一个比我在 web 上通常能找到的问题更微妙的问题。我目前正在研究一个玩具 C 编译器，只是为了好玩, 但我确实想看看如果我能够只专注于核心语言标准的一个版本而没有 ex 提供的许多可选功能，我能从中获得什么样的性能。 clang or gcc. As such, I want to be able to efficiently read and process files for my lexer from disk. clang 或 gcc。因此，我希望能够高效地从磁盘读取和处理我的词法分析器的文件。 Given how many header files the average source file includes (especially considering recursive includes) and the number of source files in the average program, efficient file reading will be very important.鉴于平均源文件包含多少个 header 个文件（特别是考虑递归包含）和平均程序中的源文件数量，高效的文件读取将非常重要。 The two system types I want to target are Linux and macOS.我要定位的两种系统类型是 Linux 和 macOS。 Both of these systems provide two main ways of dealing with file I/O: (buffered or unbuffered) open and read calls to read a file as a stream, and mmap to directly allocate virtual memory into which the file is transparently mapped.这两个系统都提供了两种处理文件 I/O 的主要方式：（缓冲或非缓冲） open和read调用以将文件读取为 stream，以及mmap直接分配虚拟 memory，文件透明映射到其中。

Most askers of similar questions (as above) seem to have had very different use cases: either they are dealing with a small number of very large files or they are dealing with applications where file I/O is not truly a major bottleneck.大多数提出类似问题（如上）的提问者似乎都有非常不同的用例：他们要么处理少量非常大的文件，要么处理文件 I/O 并不是真正主要瓶颈的应用程序。 To be fair, I may be being naive, as I haven't completed the program yet, so I may end up falling into the second group, but I did want to at least see what others thought in this regard.公平地说，我可能太天真了，因为我还没有完成这个项目，所以我可能会落入第二组，但我确实想至少看看其他人在这方面的想法。 For instance, I know LLVM will use mmap to read files if they are over the current page size or 16KB, and will use open/read calls if they are not.例如，我知道如果文件超过当前页面大小或 16KB，LLVM 将使用mmap读取文件，否则将使用open/read调用。

The question is then which of these methods is best when dealing with a large number of files of varying sizes?那么问题是，在处理大量不同大小的文件时，哪种方法最好？ The goal is to be able to read the files into memory and parse them character by character multiple times (preprocessor and main C language processing).目标是能够将文件读入 memory 并逐字符多次解析（预处理器和主要 C 语言处理）。 Is there some good threshold I could find where files over a given length should be mapped vs buffered in the heap?是否有一些好的阈值我可以找到超过给定长度的文件应该映射到堆中还是缓冲在堆中？ Should I just use one of these approaches over the other in all cases?在所有情况下，我应该只使用其中一种方法而不是另一种方法吗？ My goal is mainly on speed: I don't want to have to bottleneck on file I/O when I could be parsing code instead.我的目标主要是速度：当我可以解析代码时，我不想在文件 I/O 上遇到瓶颈。

1 个解决方案

My goal is mainly on speed: I don't want to have to bottleneck on file I/O when I could be parsing code instead.我的目标主要是速度：当我可以解析代码时，我不想在文件 I/O 上遇到瓶颈。

Both read ing and mmap ing a file should perform the same amount of I/O -- the kernel will have to read the data from disk into memory either way. read ing 和mmap ing 一个文件应该执行相同数量的 I/O —— kernel 必须以任何一种方式将数据从磁盘读入 memory。

If you have many files smaller than page size, using mmap will waste a lot of memory. This may not matter on 64-bit machine, but you could run out of VM space if your compiler is built in 32-bit mode.如果你有很多小于页面大小的文件，使用mmap会浪费很多 memory。这在 64 位机器上可能无关紧要，但如果你的编译器是在 32 位模式下构建的，你可能会耗尽 VM 空间。

If you are going to parse the same files repeatedly (which is an unusual thing to do in the compiler), you may be better off with mmap .如果您要重复解析相同的文件（这在编译器中是不寻常的事情），您最好使用mmap 。

You could also get drastically different performance results depending on how much memory your machine has, whether it has SSD or spinning disk, etc.根据你的机器有多少 memory，它是否有 SSD 或旋转磁盘等，你也可能会得到截然不同的性能结果。

TL;DR: you are unlikely to get a definitive answer -- there are too many variables for one. TL;DR：您不太可能得到明确的答案——变量太多了。