简体繁体中英

Can I use memory mapping file to remove or improve the disk IO time in a data generating-processing workflow?

原文 2022-09-09 21:45:07 4 1 caching/ parallel-processing/ ipc/ mmap

I have two programs, the first program (lets' call it A) creates a huge chunk of data and save them on disk, the second program (lets' call it B) reads data from disk and perform data processing. The old workflow is that, I run program A, save data on disk, then run program B, load the data from disk, then process the data. However, this is very time-consuming, since we need two disk IO for large data.

One trivial way to solve this problem is to simply merge the two programs. However, I do NOT want to do this (imagine with a single dataset, we want to have multiple data processing programs running in parallel on the same node, which makes it necessary to separate the two programs). I was told that there is a technique called memory mapping file, which allows multiple processes to communicate and share memory. I find some reference in https://man7.org/linux/man-pages/man3/shm_unlink.3.html .

However, in the example shown there, the execution of two programs (processes) is overlapped, and the two processes communicate with each other in a "bouncing" fashion. In my case, I am not allowed to have such communication pattern. For some reason I have to make sure that program B is executed only after program A is finished (serial workflow). I just wonder if mmap can still be used in my case? I know it seems weird since at some point there is some memory allocated by program A while no program is running (between A and B), which might leads to memory leak, but if this optimization is possible, it would be a huge improvement. Thanks!

1 answers

Memory mapped files and shared memory are two different concepts.

The former enable you to map a file in memory so that reads to the memory location read the file and write to the memory location write into the file. This kind of operation is very useful to abstract IO accesses as basic memory read/write. It is especially useful for big-data applications (or just to reuse code so to compute files directly).

The later is typically used for multiple running programs to communicate together while being in different processes. For example, programs like the Chrome/Chromium browser use that so to communicate between tabs that are different processes (for sake of security). It is also used in HPC for fast MPI communication between processes lying on the same computing node.

Linux also enable you to use pipes so for one process to send data to another. The pipe is closed when the process emitting data ends. This is useful for dataflow-based processing (eg. text filtering using grep for example).

In your case, it seems like 1 process is run and then the other starts only when the first process is finished. This means data needs to be mandatory stored in a file. Shared memory cannot be used here. That being said, this does not mean the file has to be stored on a storage device. On Linux for example, you can store files in RAM using RAMFS for example which is a filesystem stored in RAM. Note that files stored in such filesystem are not saved anywhere when the machine is shutdown (accidentally or deliberately) so it should not be used for critical data unless you can be sure the machine will not crash / be shutdown. RAMFS have a limited space and AFAIK the configuration of such filesystem require root privileges.

An alternative solution is to create a mediator process (M) with one purpose: receiving data from a process and sending it to other processes. Shared memory can be used in this case since A and B communicate with M and pair of processes are alive simultaneously. A can directly write in the memory shared by M once shared and B can read it later. M needs to be created before A/B and finished after A/B.

Is there a Java library the can cache data in memory or disk depending on the size of the data?

Memory Caching of Data from Disk

does Apache caches the SESSION data in memory when I use the file handler for session?

Besides caching, how can I use more memory rather than cpu time on a LAMP server?

How to improve Flask development workflow

Use Guava Cache to persist data to a hard disk

Python Storing a Binary Data in File on disk

Memory Mapping File C# in Web Farm

Can Redis use disk as part of a LRU cache?

How do I use disk caching in Picasso?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Is there a Java library the can cache data in memory or disk depending on the size of the data? Memory Caching of Data from Disk does Apache caches the SESSION data in memory when I use the file handler for session? Besides caching, how can I use more memory rather than cpu time on a LAMP server? How to improve Flask development workflow Use Guava Cache to persist data to a hard disk Python Storing a Binary Data in File on disk Memory Mapping File C# in Web Farm Can Redis use disk as part of a LRU cache? How do I use disk caching in Picasso?

Related Tags

Can I use memory mapping file to remove or improve the disk IO time in a data generating-processing workflow?

Question

1 answers

solution1 0 2022-09-10 10:54:42

solution1
0 2022-09-10 10:54:42