共享 memory 可以提高 mmap 的性能吗？

Question

I'm writing a small program for Wayland that uses software rendering and wl_shm for display.我正在为 Wayland 编写一个使用软件渲染和wl_shm进行显示的小程序。 This requires that I pass a file descriptor for my screen buffer to the Wayland server, which then calls mmap() on it, ie the screen buffer must be shareable between processes.这要求我将屏幕缓冲区的文件描述符传递给 Wayland 服务器，然后在其上调用mmap() ，即屏幕缓冲区必须在进程之间共享。

In this program, startup latency is key.在这个程序中，启动延迟是关键。 Currently, there is only one remaining bottleneck: the initial draw to the screen buffer, where the entire buffer is painted over.目前，只剩下一个瓶颈：初始绘制到屏幕缓冲区，整个缓冲区都被绘制。 The code below shows a simplified version of this:下面的代码显示了这个的简化版本：

#define _GNU_SOURCE
#include <unistd.h>
#include <sys/mman.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main()
{
    /* Fullscreen buffers are around 10-30 MiB for common resolutions. */
    const size_t size = 2880 * 1800 * 4;
    int fd = memfd_create("shm", 0);
    ftruncate(fd, size);
    void *pool = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

    /* Ideally, we could just malloc, but this memory needs to be shared. */
    //void *pool = malloc(size);

    /* In reality this is a cairo_paint() call. */
    memset(pool, 0xCF, size);

    /* Subsequent paints (or memsets) after the first take negligible time. */
}

On my laptop, the memset() above takes around 21-28 ms.在我的笔记本电脑上，上面的memset()大约需要 21-28 毫秒。 Switching to malloc() 'ed memory drops this to 12 ms, but the problem is that the memory needs to be shared between processes.切换到malloc() 'ed memory 将此时间降至 12 ms，但问题是 memory 需要在进程之间共享。 The behaviour is similar on my desktop: 7 ms for mmap() , 3 ms for malloc() .我的桌面上的行为类似： mmap()为 7 ms， malloc()为 3 ms。

My question is: Is there something I'm missing that can improve the performance of shared memory on Linux?我的问题是：我是否缺少可以提高 Linux 上共享 memory 的性能的东西？ I've tried madvise() with MADV_WILLNEED and MADV_SEQUENTIAL , and using mlock() , but none of those made a difference.我试过madvise()和MADV_WILLNEED和MADV_SEQUENTIAL ，并使用mlock() ，但这些都没有改变。 I've also thought about whether 2MB Huge Pages would help given the buffer sizes of around 10-30 MB, but that's not usually available.考虑到大约 10-30 MB 的缓冲区大小，我还考虑过 2MB Huge Pages 是否会有所帮助，但这通常不可用。

Edit: I've tried mmap() with MAP_ANONYMOUS | MAP_SHARED编辑：我试过mmap()和MAP_ANONYMOUS | MAP_SHARED MAP_ANONYMOUS | MAP_SHARED , which is just as slow as before. MAP_ANONYMOUS | MAP_SHARED ，它和以前一样慢。 MAP_ANONYMOUS | MAP_PRIVATE MAP_ANONYMOUS | MAP_PRIVATE results in the same speed as malloc() , however that defeats the purpose. MAP_ANONYMOUS | MAP_PRIVATE导致与malloc()相同的速度，但这违背了目的。

Answer 1

The difference in performance between malloc() and mmap() seems to be due to the differing application of Transparent Hugepages . malloc()和mmap()之间的性能差异似乎是由于Transparent Hugepages的不同应用。

By default on x86_64, the page size is 4KiB and the huge page size is 2MiB.默认情况下，在 x86_64 上，页面大小为 4KiB，大页面大小为 2MiB。 Transparent Hugepages allows programs that don't know about hugepages to still use them, reducing page faults.透明大页面允许不知道大页面的程序仍然使用它们，从而减少页面错误。 This is only enabled by default for private, anonymous memory however - thus for both malloc() and mmap() with MAP_ANONYMOUS | MAP_PRIVATE这仅在默认情况下对私有匿名 memory 启用 - 因此对于带有 MAP_ANONYMOUS 的malloc()和mmap() MAP_ANONYMOUS | MAP_PRIVATE MAP_ANONYMOUS | MAP_PRIVATE set, explaining why the performance of these is identical. MAP_ANONYMOUS | MAP_PRIVATE集，解释了为什么这些性能是相同的。 For shared memory mappings, this is disabled, resulting in more page handling overhead (for the 10-30MiB buffers I need), and causing slowdowns.对于共享的 memory 映射，这是禁用的，导致更多的页面处理开销（对于我需要的 10-30MiB 缓冲区），并导致速度变慢。

Hugepages can be enabled for shared memory mappings, as explained in the kernel docs page , via the /sys/kernel/mm/transparent_hugepage/shmem_enabled knob.可以通过/sys/kernel/mm/transparent_hugepage/shmem_enabled旋钮为共享 memory 映射启用大页面，如kernel 文档页面中所述。 This defaults to never , but setting it to always (or advise , and adding the corresponding madvise(..., MADV_HUGEPAGE) call) allows memory mapped with MAP_SHARED to use hugepages, and the performance matches malloc() 'ed memory.这默认为never ，但将其设置为always （或advise ，并添加相应的madvise(..., MADV_HUGEPAGE)调用）允许使用MAP_SHARED映射的 memory 使用大页面，并且性能匹配malloc() 'ed memory。

I'm unsure why the default is never for shared memory.我不确定为什么默认值never用于共享 memory。 While not very satisfactory, for now it seems the only solution is to use madvise(MADV_HUGEPAGE) to improve performance on any systems which happen to have shmem_enabled set to at least advise (or if it's enabled by default in future).虽然不是很令人满意，但目前看来唯一的解决方案是使用madvise(MADV_HUGEPAGE)来提高任何系统的性能，这些系统恰好将shmem_enabled设置为至少advise （或者将来默认启用）。

共享 memory 可以提高 mmap 的性能吗？

问题描述

1 个解决方案

解决方案1
0 2022-08-08 21:01:38

共享 memory 可以提高 mmap 的性能吗？

问题描述

1 个解决方案

解决方案1 0 2022-08-08 21:01:38

解决方案1
0 2022-08-08 21:01:38