简体繁体 English

我可以使用 memory 映射文件在数据生成处理工作流程中删除或改进磁盘 IO 时间吗？

[英]Can I use memory mapping file to remove or improve the disk IO time in a data generating-processing workflow?

原文 2022-09-09 21:45:07 6 1 caching/ parallel-processing/ ipc/ mmap

I have two programs, the first program (lets' call it A) creates a huge chunk of data and save them on disk, the second program (lets' call it B) reads data from disk and perform data processing.我有两个程序，第一个程序（我们称之为 A）创建大量数据并将它们保存在磁盘上，第二个程序（我们称之为 B）从磁盘读取数据并执行数据处理。 The old workflow is that, I run program A, save data on disk, then run program B, load the data from disk, then process the data.旧的工作流程是，我运行程序 A，将数据保存在磁盘上，然后运行程序 B，从磁盘加载数据，然后处理数据。 However, this is very time-consuming, since we need two disk IO for large data.但是，这非常耗时，因为我们需要两个磁盘 IO 来处理大数据。

One trivial way to solve this problem is to simply merge the two programs.解决此问题的一种简单方法是简单地合并两个程序。 However, I do NOT want to do this (imagine with a single dataset, we want to have multiple data processing programs running in parallel on the same node, which makes it necessary to separate the two programs).但是，我不想这样做（假设使用单个数据集，我们希望在同一个节点上并行运行多个数据处理程序，这使得有必要将两个程序分开）。 I was told that there is a technique called memory mapping file, which allows multiple processes to communicate and share memory.有人告诉我有一种技术叫做 memory 映射文件，它允许多个进程通信和共享 memory。 I find some reference in https://man7.org/linux/man-pages/man3/shm_unlink.3.html .我在https://man7.org/linux/man-pages/man3/shm_unlink.3.html中找到了一些参考资料。

However, in the example shown there, the execution of two programs (processes) is overlapped, and the two processes communicate with each other in a "bouncing" fashion.但是，在此处显示的示例中，两个程序（进程）的执行是重叠的，并且两个进程以“弹跳”方式相互通信。 In my case, I am not allowed to have such communication pattern.就我而言，我不允许有这样的沟通模式。 For some reason I have to make sure that program B is executed only after program A is finished (serial workflow).出于某种原因，我必须确保仅在程序 A 完成后才执行程序 B（串行工作流程）。 I just wonder if mmap can still be used in my case?我只是想知道 mmap 在我的情况下是否仍然可以使用？ I know it seems weird since at some point there is some memory allocated by program A while no program is running (between A and B), which might leads to memory leak, but if this optimization is possible, it would be a huge improvement.我知道这看起来很奇怪，因为在某些时候程序 A 分配了一些 memory 而没有程序正在运行（在 A 和 B 之间），这可能导致 memory 泄漏，但如果可以进行这种优化，那将是一个巨大的改进。 Thanks!谢谢！

1 个解决方案

Memory mapped files and shared memory are two different concepts. Memory 映射文件和共享 memory 是两个不同的概念。

The former enable you to map a file in memory so that reads to the memory location read the file and write to the memory location write into the file. The former enable you to map a file in memory so that reads to the memory location read the file and write to the memory location write into the file. This kind of operation is very useful to abstract IO accesses as basic memory read/write.这种操作对于将 IO 访问抽象为基本的 memory 读/写非常有用。 It is especially useful for big-data applications (or just to reuse code so to compute files directly).它对于大数据应用程序特别有用（或者只是重用代码以便直接计算文件）。

The later is typically used for multiple running programs to communicate together while being in different processes.后者通常用于多个正在运行的程序在不同进程中进行通信。 For example, programs like the Chrome/Chromium browser use that so to communicate between tabs that are different processes (for sake of security).例如，像 Chrome/Chromium 浏览器这样的程序使用它在不同进程的选项卡之间进行通信（为了安全起见）。 It is also used in HPC for fast MPI communication between processes lying on the same computing node.它还用于 HPC 中，用于位于同一计算节点上的进程之间的快速 MPI 通信。

Linux also enable you to use pipes so for one process to send data to another. Linux 还使您能够使用管道，以便一个进程将数据发送到另一个进程。 The pipe is closed when the process emitting data ends. pipe 在进程发送数据结束时关闭。 This is useful for dataflow-based processing (eg. text filtering using grep for example).这对于基于数据流的处理很有用（例如，使用grep进行文本过滤）。

In your case, it seems like 1 process is run and then the other starts only when the first process is finished.在您的情况下，似乎有一个进程正在运行，然后另一个进程仅在第一个进程完成时才启动。 This means data needs to be mandatory stored in a file.这意味着数据需要强制存储在文件中。 Shared memory cannot be used here.共享 memory 不能在这里使用。 That being said, this does not mean the file has to be stored on a storage device.话虽如此，这并不意味着文件必须存储在存储设备上。 On Linux for example, you can store files in RAM using RAMFS for example which is a filesystem stored in RAM.例如，在 Linux 上，您可以使用RAMFS将文件存储在 RAM 中，例如 RAMFS 是存储在 RAM 中的文件系统。 Note that files stored in such filesystem are not saved anywhere when the machine is shutdown (accidentally or deliberately) so it should not be used for critical data unless you can be sure the machine will not crash / be shutdown.请注意，当机器关闭（意外或故意）时，存储在此类文件系统中的文件不会保存在任何地方，因此不应将其用于关键数据，除非您可以确定机器不会崩溃/关闭。 RAMFS have a limited space and AFAIK the configuration of such filesystem require root privileges. RAMFS 的空间有限，AFAIK 此类文件系统的配置需要 root 权限。

An alternative solution is to create a mediator process (M) with one purpose: receiving data from a process and sending it to other processes.另一种解决方案是创建一个具有一个目的的中介进程 (M)：从一个进程接收数据并将其发送到其他进程。 Shared memory can be used in this case since A and B communicate with M and pair of processes are alive simultaneously.共享 memory 可以在这种情况下使用，因为 A 和 B 与 M 通信并且一对进程同时处于活动状态。 A can directly write in the memory shared by M once shared and B can read it later. A 可以直接写入 M 共享的 memory 一次共享，B 可以稍后读取。 M needs to be created before A/B and finished after A/B. M需要在A/B之前创建，在A/B之后完成。