加速文件I / O：mmap（）与read（）

Question

I have a Linux application that reads 150-200 files (4-10GB) in parallel. 我有一个Linux应用程序，并行读取150-200个文件（4-10GB）。 Each file is read in turn in small, variably sized blocks, typically less than 2K each. 每个文件依次以小的，可变大小的块读取，每个块通常小于2K。

I currently need to maintain over 200 MB/s read rate combined from the set of files. 我目前需要从文件集中保持超过200 MB / s的读取速率。 The disks handle this just fine. 磁盘处理这个很好。 There is a projected requirement of over 1 GB/s (which is out of the disk's reach at the moment). 预计需要超过1 GB / s（目前不在磁盘范围内）。

We have implemented two different read systems both make heavy use of posix_advise : first is a mmap ed read in which we map the entirety of the data set and read on demand. 我们已经实现了两个不同的读取系统，它们都大量使用posix_advise ：首先是mmap ed读取，其中我们映射整个数据集并按需读取。 The second is a read() / seek() based system. 第二个是基于read() / seek()的系统。

Both work well but only for the moderate cases, the read() method manages our overall file cache much better and can deal well with 100s of GB of files, but is badly rate limited, mmap is able to pre-cache data making the sustained data rate of over 200MB/s easy to maintain, but cannot deal with large total data set sizes. 两者都运行良好，但仅适用于中等情况， read()方法更好地管理我们的整体文件缓存，并且可以很好地处理100 GB的文件，但是速率受到严重限制， mmap能够预先缓存数据使得持续数据速率超过200MB / s易于维护，但无法处理较大的总数据集大小。

So my question comes to these: 所以我的问题来自于：

A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect? 答：可以read()类型文件i / o除了在Linux上的posix_advise调用之外进一步优化，或者调整了磁盘调度程序，VMM和posix_advise调用是否达到了我们的预期效果？

B: Are there systematic ways for mmap to better deal with very large mapped data? B：有没有系统的方法让mmap更好地处理非常大的映射数据？

Mmap-vs-reading-blocks is a similar problem to what I am working and provided a good starting point on this problem, along with the discussions in mmap-vs-read . Mmap-vs-reading-blocks与我正在工作的问题类似，并提供了一个关于这个问题的良好起点，以及mmap-vs-read中的讨论。

Answer 1

Reads back to what? 回读一下？ What is the final destination of this data? 这些数据的最终目的地是什么？

Since it sounds like you are completely IO bound, mmap and read should make no difference. 因为听起来你完全是IO绑定的，所以mmap和read应该没什么区别。 The interesting part is in how you get the data to your receiver. 有趣的部分是如何将数据传输到接收器。

Assuming you're putting this data to a pipe, I recommend you just dump the contents of each file in its entirety into the pipe. 假设您将此数据放入管道，我建议您将每个文件的内容全部转储到管道中。 To do this using zero-copy, try the splice system call. 要使用零拷贝执行此操作，请尝试使用splice系统调用。 You might also try copying the file manually, or forking an instance of cat or some other tool that can buffer heavily with the current file as stdin, and the pipe as stdout. 您也可以尝试手动复制文件，或者分配cat或其他工具的实例，这些工具可以使用当前文件作为stdin进行大量缓冲，并将管道作为stdout进行缓冲。

if (pid = fork()) {
    waitpid(pid, ...);
} else {
    dup2(dest, 1);
    dup2(source, 0);
    execlp("cat", "cat");
}

Update0 Update0

If your processing is file-agnostic, and doesn't require random access, you want to create a pipeline using the options outlined above. 如果您的处理与文件无关，并且不需要随机访问，则需要使用上面列出的选项创建管道。 Your processing step should accept data from stdin, or a pipe. 您的处理步骤应接受来自stdin或管道的数据。

To answer your more specific questions: 要回答您更具体的问题：

A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect? 答：可以读取（）类型文件i / o除了在Linux上的posix_advise调用之外进一步优化，或者调整了磁盘调度程序，VMM和posix_advise调用是否达到了我们的预期效果？

That's as good as it gets with regard to telling the kernel what to do from userspace. 就告诉内核从用户空间做什么而言，这是一样好的。 The rest is up to you: buffering, threading etc. but it's dangerous and probably unproductive guess work. 剩下的由您自己决定：缓冲，线程等等，但它是危险的，可能是非生产性的猜测工作。 I'd just go with splicing the files into a pipe. 我只是把文件拼接成一个管道。

B: Are there systematic ways for mmap to better deal with very large mapped data? B：有没有系统的方法让mmap更好地处理非常大的映射数据？

Yes. 是。 The following options may give you awesome performance benefits (and may make mmap worth using over read, with testing): 以下选项可能会给您带来很棒的性能优势（并且可能会使mmap值得使用而不是通过测试）：

MAP_HUGETLB Allocate the mapping using "huge pages." MAP_HUGETLB使用“巨大页面”分配映射。
This will reduce the paging overhead in the kernel, which is great if you will be mapping gigabyte sized files. 这将减少内核中的分页开销，如果您要映射千兆字节大小的文件，这将是很好的。
MAP_NORESERVE Do not reserve swap space for this mapping. MAP_NORESERVE不为此映射保留交换空间。 When swap space is reserved, one has the guarantee that it is possible to modify the mapping. 保留交换空间时，可以保证可以修改映射。 When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available. 如果没有保留交换空间，如果没有可用的物理内存，则可能在写入时获得SIGSEGV。
This will prevent you running out of memory while keeping your implementation simple if you don't actually have enough physical memory + swap for the entire mapping.** 如果实际上没有足够的物理内存+交换来完成整个映射，这将防止内存不足，同时保持实现简单。**
MAP_POPULATE Populate (prefault) page tables for a mapping. MAP_POPULATE填充（prefault）页表以进行映射。 For a file mapping, this causes read-ahead on the file. 对于文件映射，这会导致对文件进行预读。 Later accesses to the mapping will not be blocked by page faults. 以后访问映射不会被页面错误阻止。
This may give you speed-ups with sufficient hardware resources, and if the prefetching is ordered, and lazy. 这可能会为您提供足够的硬件资源，并且如果预订是有序的，则是懒惰的。 I suspect this flag is redundant, the VFS likely does this better by default. 我怀疑这个标志是多余的，VFS默认情况下可能做得更好。

Answer 2

Perhaps using the readahead system call might help, if your program can predict in advance the file fragments it wants to read (but this is only a guess, I could be wrong). 也许使用readahead系统调用可能会有所帮助，如果你的程序可以预先预测它想要读取的文件片段（但这只是猜测，我可能是错的）。

And I think you should tune your application, and perhaps even your algorithms, to read data in chunk much bigger than a few kilobytes. 而且我认为你应该调整你的应用程序，甚至是你的算法，来读取大于几千字节的数据块。 Can't than be half a megabyte instead? 相反，不能超过半兆？

Answer 3

The problem here doesn't seem to be which api is used. 这里的问题似乎不是使用哪个api。 It doesn't matter if you use mmap() or read(), the disc still has to seek to the specified point and read the data (although the os does help to optimize the access). 如果你使用mmap（）或read（）并不重要，光盘仍然必须寻找指定点并读取数据（尽管os确实有助于优化访问）。

mmap() has advantages over read() if you read very small chunks (a couple of bytes) because you don't have call the os for every chunk, which becomes very slow. 如果你读取非常小的块（几个字节），mmap（）比read（）有优势，因为你没有为每个块调用os，这变得非常慢。

I would also advise like Basile did to read more than 2kb consecutively so the disc doesn't have to seek that often. 我也建议像Basile那样连续读取超过2kb的光盘，这样光盘就不必经常寻找。

加速文件I / O：mmap（）与read（）

问题描述

3 个解决方案

解决方案1
14 2011-11-08 21:26:29

Update0 Update0

解决方案2
4 2011-11-08 21:04:36

解决方案3
1 2011-11-08 21:23:24

加速文件I / O：mmap（）与read（）

问题描述

3 个解决方案

解决方案1 14 2011-11-08 21:26:29

Update0 Update0

解决方案2 4 2011-11-08 21:04:36

解决方案3 1 2011-11-08 21:23:24

解决方案1
14 2011-11-08 21:26:29

解决方案2
4 2011-11-08 21:04:36

解决方案3
1 2011-11-08 21:23:24