简体繁体 English

Linux Kernel - msync 锁定行为

[英]Linux Kernel - msync Locking Behavior

原文 2021-03-24 10:53:43 8 1 linux/ kernel/ mmap

I am investigating an application that writes random data in fixed-size chunks (eg 4k) to random locations in a large buffer file.我正在研究一个将固定大小的块（例如 4k）中的随机数据写入大缓冲区文件中的随机位置的应用程序。 I have several processes (not threads) doing that, each process has its own buffer file assigned.我有几个进程（不是线程）这样做，每个进程都有自己的缓冲区文件分配。

If I use mmap+msync to write and persist data to disk, I see a performance spike for 16 processes, and a performance drop for more threads (32 processes).如果我使用 mmap+msync 将数据写入和持久化到磁盘，我会看到 16 个进程的性能峰值，以及更多线程（32 个进程）的性能下降。

If I use open+write+fsync, I do not see such a spike, instead a performance plateau (and mmap is slower than open/write).如果我使用 open+write+fsync，我看不到这样的尖峰，而是一个性能平台（并且 mmap 比 open/write 慢）。

I've read multiple times [1,2] that both mmap and msync can take locks.我已经多次阅读 [1,2]，mmap 和 msync 都可以锁定。 With vtune, I analyzed that we are indeed spinlocking, and spending the most time in clear_page_erms and xas_load functions.通过 vtune，我分析到我们确实在自旋锁定，并且在clear_page_erms和xas_load函数上花费的时间最多。

However, when reading the source code for msync [3], I cannot understand whether these locks are global or per-file.但是，在阅读 msync [3] 的源代码时，我无法理解这些锁是全局的还是每个文件的。 The paper [2] states that the locks are on radix-trees within the kernel that are per-file, however, as I do observe some spinlocks in the kernel, I believe that some locks may be global, as I have one file per process.论文 [2] 指出锁位于 kernel 中的基数树上，它们是每个文件的，但是，由于我确实观察到 kernel 中的一些自旋锁，我相信有些锁可能是全局的，因为我每个文件都有一个文件过程。

Do you have an explanation on why we have such a spike at 16 processes for mmap and input on the locking behavior of msync?您是否解释了为什么我们在 16 个进程中对 mmap 和对 msync 的锁定行为的输入有如此高的峰值？

Thank you!谢谢！

Best, Maximilian最佳，马克西米利安

[1] https://kb.pmem.io/development/100000025-Why-msync-is-less-optimal-for-persistent-memory/ [1] https://kb.pmem.io/development/100000025-Why-msync-is-less-optimal-for-persistent-memory/

[2] Optimizing Memory-mapped I/O for Fast Storage Devices, Papagiannis et al., ATC '20 [2] 为快速存储设备优化内存映射 I/O，Papagiannis 等人，ATC '20

[3] https://elixir.bootlin.com/linux/latest/source/mm/msync.c [3] https://elixir.bootlin.com/linux/latest/source/mm/msync.c

1 个解决方案

In the 4.19 kernel source, the current task keeps its own mm_struct, which contains a single semaphore used to arbitrate accesses to all memory regions being synced.在 4.19 kernel 源中，当前任务保留自己的 mm_struct，其中包含一个信号量，用于仲裁对所有正在同步的 memory 区域的访问。 All of the threads in a process acting on one of your buffer files will therefore take this semaphore, operate on some region(s) of the file, and release the semaphore.因此，作用于您的一个缓冲区文件的进程中的所有线程都将采用此信号量，对文件的某些区域进行操作，并释放该信号量。

While I can't rationalise the exact number of 16 processes where you hit your performance cliff, clearly when you use mmap() you are forcing entry into the msync(M_SYNC) code section for VM_SHARED.虽然我无法合理化您遇到性能悬崖的 16 个进程的确切数量，但很明显，当您使用 mmap() 时，您会强制进入 VM_SHARED 的 msync(M_SYNC) 代码部分。 This invokes vfs_fsync_range() and guarantees actual synchronous disk I/O will happen, which is going to generally slow down the show: it does not allow advantageous grouping of I/Os for economy and tends to maximise the actual time spent waiting for disk I/O to complete.这会调用 vfs_fsync_range() 并保证会发生实际的同步磁盘 I/O，这通常会减慢显示速度：它不允许对 I/O 进行有利的分组以节省成本，并且倾向于最大化等待磁盘 I 所花费的实际时间/O 完成。

To avoid this, ensure that each thread in your process manages a dedicated subset of 4K chunks in the buffer file, avoids mmap() on the buffer file, and schedules async I/O.为避免这种情况，请确保进程中的每个线程都管理缓冲区文件中 4K 块的专用子集，避免缓冲区文件上的 mmap()，并安排异步 I/O。 So long as you avoid mmap() on the buffer file itself , each thread will be alone in writing (safely, if you design well) to its own section of the buffer file.只要您在缓冲区文件本身上避免 mmap() ，每个线程将单独（安全地，如果您设计得好）写入缓冲区文件自己的部分。 You will therefore be able to specify all of your I/O to be asynchronous, which should allow better aggregation and significantly improve your application's performance ie avoid that cliff at 16 processes or whatever the number ends up being.因此，您将能够将所有 I/O 指定为异步的，这应该允许更好的聚合并显着提高应用程序的性能，即避免出现 16 个进程或任何最终数字的悬崖。 Obviously, you'll still have to ensure that any thread writing to one of its chunks either completes that write or has not yet begun it (and drops any request(s) to do so) if a request for another write to the same chunk comes along.显然，如果另一个写入同一块的请求，您仍然必须确保写入其中一个块的任何线程要么完成该写入，要么尚未开始它（并丢弃任何这样做的请求）随之而来。