简体   繁体   中英

Can I use memory mapping file to remove or improve the disk IO time in a data generating-processing workflow?

I have two programs, the first program (lets' call it A) creates a huge chunk of data and save them on disk, the second program (lets' call it B) reads data from disk and perform data processing. The old workflow is that, I run program A, save data on disk, then run program B, load the data from disk, then process the data. However, this is very time-consuming, since we need two disk IO for large data.

One trivial way to solve this problem is to simply merge the two programs. However, I do NOT want to do this (imagine with a single dataset, we want to have multiple data processing programs running in parallel on the same node, which makes it necessary to separate the two programs). I was told that there is a technique called memory mapping file, which allows multiple processes to communicate and share memory. I find some reference in https://man7.org/linux/man-pages/man3/shm_unlink.3.html .

However, in the example shown there, the execution of two programs (processes) is overlapped, and the two processes communicate with each other in a "bouncing" fashion. In my case, I am not allowed to have such communication pattern. For some reason I have to make sure that program B is executed only after program A is finished (serial workflow). I just wonder if mmap can still be used in my case? I know it seems weird since at some point there is some memory allocated by program A while no program is running (between A and B), which might leads to memory leak, but if this optimization is possible, it would be a huge improvement. Thanks!

Memory mapped files and shared memory are two different concepts.

The former enable you to map a file in memory so that reads to the memory location read the file and write to the memory location write into the file. This kind of operation is very useful to abstract IO accesses as basic memory read/write. It is especially useful for big-data applications (or just to reuse code so to compute files directly).

The later is typically used for multiple running programs to communicate together while being in different processes. For example, programs like the Chrome/Chromium browser use that so to communicate between tabs that are different processes (for sake of security). It is also used in HPC for fast MPI communication between processes lying on the same computing node.

Linux also enable you to use pipes so for one process to send data to another. The pipe is closed when the process emitting data ends. This is useful for dataflow-based processing (eg. text filtering using grep for example).

In your case, it seems like 1 process is run and then the other starts only when the first process is finished. This means data needs to be mandatory stored in a file. Shared memory cannot be used here. That being said, this does not mean the file has to be stored on a storage device. On Linux for example, you can store files in RAM using RAMFS for example which is a filesystem stored in RAM. Note that files stored in such filesystem are not saved anywhere when the machine is shutdown (accidentally or deliberately) so it should not be used for critical data unless you can be sure the machine will not crash / be shutdown. RAMFS have a limited space and AFAIK the configuration of such filesystem require root privileges.

An alternative solution is to create a mediator process (M) with one purpose: receiving data from a process and sending it to other processes. Shared memory can be used in this case since A and B communicate with M and pair of processes are alive simultaneously. A can directly write in the memory shared by M once shared and B can read it later. M needs to be created before A/B and finished after A/B.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM