从未初始化的缓冲区复制比从初始化的缓冲区快得多

Question

I was tasked to develop a test software to generate 100Gbps of traffic over 1 TCP socket on Linux (X86-64, kernel 4.15) on a machine with 32GB of RAM.我的任务是开发一个测试软件，在具有 32GB RAM 的机器上通过 Linux（X86-64，内核 4.15）上的 1 个 TCP 套接字生成 100Gbps 的流量。

I developed something like the following code (removed some sanity checks for simplicity) to run on a pair of veth interfaces (one of them is in a different netns).我开发了类似以下代码的代码（为简单起见删除了一些健全性检查）以在一对 veth 接口（其中一个在不同的 netns 中）上运行。

It generate about 60Gbps on my PC according to bmon ,an open source software.根据开源软件bmon ，它在我的 PC 上生成大约 60Gbps。 To my surprise, if I remove the statement memset(buff, 0, size);令我惊讶的是，如果我删除语句memset(buff, 0, size); , I get about 94Gbps. ，我得到大约 94Gbps。 That's very puzzling.这非常令人费解。

void test(int sock) {
    int size = 500 * 0x100000;
    char *buff = malloc(size);
    //optional
    memset(buff, 0, size);
    int offset = 0;
    int chunkSize = 0x200000;
    while (1) {
        offset = 0;
        while (offset < size) {
            chunkSize = size - offset;
            if (chunkSize > CHUNK_SIZE) chunkSize = CHUNK_SIZE;
            send(sock, &buff[offset], chunkSize, 0);
            offset += chunkSize;
        }
    }
}

I did some experiments by replacing memset(buff, 0, size);我通过替换memset(buff, 0, size);做了一些实验。 with the following (initialize a portion of the buff),使用以下内容（初始化一部分buff），

memset(buff, 0, size * ratio);

if ratio is 0, the throughput is highest at around 94Gbps, as the ratio goes up to 100 percent (1.0), the throughput goes down to around 60Gbps.如果 ratio 为 0，吞吐量最高约为 94Gbps，当 ratio 上升到 100% (1.0) 时，吞吐量下降到 60Gbps 左右。 If the ratio is 0.5 (50%), the throughput goes to about 72Gbps如果该比率为 0.5 (50%)，则吞吐量达到 72Gbps 左右

Appreciate any light you can shed on this.感谢您对此有所了解。

Edit 1 .编辑 1 。 Here is a relatively complete code that shows the effect that copy on initialized buffer appearing to be slower.这是一个相对完整的代码，显示了对初始化缓冲区的复制似乎更慢的效果。

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>

int size = 500 * 0x100000;
char buf[0x200000];

double getTS() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec/1000000.0;
}

void test1(int init) {
    char *p = malloc(size);
    int offset = 0;
    if (init) memset(p, 0, size);
    double startTs = getTS();
    for (int i = 0; i < 100; i++) {
        offset = 0;
        while (offset < size) {
            memcpy(&buf[0], p+offset, 0x200000);
            offset += 0x200000;
        }
    }
    printf("took %f secs\n", getTS() - startTs);
}

int main(int argc, char *argv[]) {
    test1(argc > 1);
    return 0;
}

On my PC (Linux 18.04, Linux 4.15 with 32GB RAM), tried twice without initialization, it took 1.35 seconds.在我的 PC 上（Linux 18.04、Linux 4.15 和 32GB RAM），在没有初始化的情况下尝试了两次，耗时 1.35 秒。 With initialization, it took 3.02 seconds.使用初始化，耗时 3.02 秒。

Edit 2 .编辑 2 。 Would love to get sendfile (Thanks @marco-bonelli) as fast as sending from the buffer with all 0 (by calloc).希望获得 sendfile（感谢@marco-bonelli）与从缓冲区发送全 0（通过 calloc）一样快。 I think it's going to be an requirement for my task pretty soon.我认为这将很快成为我的任务的要求。

Answer 1

I have been running various tests to investigate this surprising result.我一直在运行各种测试来调查这个令人惊讶的结果。

I wrote the test program below that combines various operations in the init phase and the loop:我编写了下面的测试程序，它结合了初始化阶段和循环中的各种操作：

#include <stdio.h>
#include <unistd.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>

int alloc_size = 500 * 0x100000;  // 500 MB
char copy_buf[0x200000];    // 2MB

double getTS() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec/1000000.0;
}

// set a word on each page of a memory area
void memtouch(void *buf, size_t size) {
    uint64_t *p = buf;
    size /= sizeof(*p);
    for (size_t i = 0; i < size; i += 4096 / sizeof(*p))
        p[i] = 0;
}

// compute the sum of words on a memory area
uint64_t sum64(const void *buf, size_t size) {
    uint64_t sum = 0;
    const uint64_t *p = buf;
    size /= sizeof(*p);
    for (size_t i = 0; i < size; i++)
        sum += p[i];
    return sum;
}

void test1(int n, int init, int run) {
    int size = alloc_size;
    char msg[80];
    int pos = 0;
    double times[n+1];
    uint64_t sum = 0;
    double initTS = getTS();
    char *p = malloc(size);

    pos = snprintf(msg + pos, sizeof msg - pos, "malloc");
    if (init > 0) {
        memset(p, init - 1, size);
        pos += snprintf(msg + pos, sizeof msg - pos, "+memset%.0d", init - 1);
    } else
    if (init == -1) {
        memtouch(p, size);
        pos += snprintf(msg + pos, sizeof msg - pos, "+memtouch");
    } else
    if (init == -2) {
        sum = sum64(p, size);
        pos += snprintf(msg + pos, sizeof msg - pos, "+sum64");
    } else {
        /* leave p uninitialized */
    }
    pos += snprintf(msg + pos, sizeof msg - pos, "+rep(%d, ", n);
    if (run > 0) {
        pos += snprintf(msg + pos, sizeof msg - pos, "memset%.0d)", run - 1);
    } else
    if (run < 0) {
        pos += snprintf(msg + pos, sizeof msg - pos, "sum64)");
    } else {
        pos += snprintf(msg + pos, sizeof msg - pos, "memcpy)");
    }
    double startTS = getTS();
    for (int i = 0; i < n; i++) {
        if (run > 0) {
            memset(p, run - 1, size);
        } else
        if (run < 0) {
            sum = sum64(p, size);
        } else {
            int offset = 0;
            while (offset < size) {
                memcpy(copy_buf, p + offset, 0x200000);
                offset += 0x200000;
            }
        }
        times[i] = getTS();
    }
    double firstTS = times[0] - startTS;
    printf("%f + %f", startTS - initTS, firstTS);
    if (n > 2) {
        double avgTS = (times[n - 2] - times[0]) / (n - 2);
        printf(" / %f", avgTS);
    }
    if (n > 1) {
        double lastTS = times[n - 1] - times[n - 2];
        printf(" / %f", lastTS);
    }
    printf(" secs  %s", msg);
    if (sum != 0) {
        printf("  sum=%016llx", (unsigned long long)sum);
    }
    printf("\n");
    free(p);
}

int main(int argc, char *argv[]) {
    int n = 4;
    if (argc < 2) {
        test1(n, 0, 0);
        test1(n, 0, 1);
        test1(n, 0, -1);
        test1(n, 1, 0);
        test1(n, 1, 1);
        test1(n, 1, -1);
        test1(n, 2, 0);
        test1(n, 2, 1);
        test1(n, 2, -1);
        test1(n, -1, 0);
        test1(n, -1, 1);
        test1(n, -1, -1);
        test1(n, -2, 0);
        test1(n, -2, 1);
        test1(n, -2, -1);
    } else {
        test1(argc > 1 ? strtol(argv[1], NULL, 0) : n,
              argc > 2 ? strtol(argv[2], NULL, 0) : 0,
              argc > 3 ? strtol(argv[2], NULL, 0) : 0);
    }
    return 0;
}

Running it on an old linux box running Debian Linux 3.16.0-11-amd64, I got these timings:在运行 Debian Linux 3.16.0-11-amd64 的旧 linux 机器上运行它，我得到了这些时间：

The columns are列是

init phase初始阶段
first iteration of the loop循环的第一次迭代
average of the second to penultimate iterations第二次到倒数第二次迭代的平均值
last iteration of the loop循环的最后一次迭代
sequence of operations操作顺序

0.000071 + 0.242601 / 0.113761 / 0.113711 secs  malloc+rep(4, memcpy)
0.000032 + 0.349896 / 0.125809 / 0.125681 secs  malloc+rep(4, memset)
0.000032 + 0.190461 / 0.049150 / 0.049210 secs  malloc+rep(4, sum64)
0.350089 + 0.186691 / 0.186705 / 0.186548 secs  malloc+memset+rep(4, memcpy)
0.350078 + 0.125603 / 0.125632 / 0.125531 secs  malloc+memset+rep(4, memset)
0.349931 + 0.105991 / 0.105859 / 0.105788 secs  malloc+memset+rep(4, sum64)
0.349860 + 0.186950 / 0.187031 / 0.186494 secs  malloc+memset1+rep(4, memcpy)
0.349584 + 0.125537 / 0.125525 / 0.125535 secs  malloc+memset1+rep(4, memset)
0.349620 + 0.106026 / 0.106114 / 0.105756 secs  malloc+memset1+rep(4, sum64)  sum=ebebebebebe80000
0.339846 + 0.186593 / 0.186686 / 0.186498 secs  malloc+memtouch+rep(4, memcpy)
0.340156 + 0.125663 / 0.125716 / 0.125751 secs  malloc+memtouch+rep(4, memset)
0.340141 + 0.105861 / 0.105806 / 0.105869 secs  malloc+memtouch+rep(4, sum64)
0.190330 + 0.113774 / 0.113730 / 0.113754 secs  malloc+sum64+rep(4, memcpy)
0.190149 + 0.400483 / 0.125638 / 0.125624 secs  malloc+sum64+rep(4, memset)
0.190214 + 0.049136 / 0.049170 / 0.049149 secs  malloc+sum64+rep(4, sum64)

The timings are consistent with the observations of the OP.时间安排与 OP 的观察结果一致。 I found an explanation that is consistent with the observed timings:我找到了一个与观察到的时间一致的解释：

if the first access to a page is a read operation , the timings are substantially better than if the first operation is a write access .如果对页面的第一次访问是读操作，则时间比第一次操作是写访问要好得多。

Here are some observations consistent with this explanation:以下是与此解释一致的一些观察结果：

malloc() for a large 500MB block just makes a system call to map the memory, it does not access this memory and calloc probably would do exactly the same thing. malloc()用于 500MB 的大块只是进行系统调用来映射内存，它不会访问该内存，而calloc可能会做同样的事情。
if you do not initialize this memory, it still gets mapped in RAM as zero initialized pages for security reasons.如果您不初始化此内存，出于安全原因，它仍会在 RAM 中映射为零初始化页面。
when you initialize the memory with memset , the first access to the whole block is a write access, then the timings for the loop are slower当您使用memset初始化内存时，对整个块的第一次访问是写访问，然后循环的时间会变慢
initializing the memory to all bytes 1 produces exactly the same timings将内存初始化为所有字节1产生完全相同的时序
if instead I use memtouch , only writing the first word to zero, I get the same timings in the loop相反，如果我使用memtouch ，只将第一个单词写为零，我会在循环中得到相同的时间
conversely, if instead of initializing the memory, I compute a checksum, it comes out as zero (which is not guarateed, but expected) and the timings in the loop are faster.相反，如果我不是初始化内存，而是计算校验和，它会为零（这不是保证的，而是预期的）并且循环中的时间更快。
if no access is performed to the block, then the timings in the loop depend on the operation performed: memcpy and sum64 are faster because the first access is a read access, while memset is slower because the first access is a write access.如果没有对块执行访问，则循环中的时间取决于执行的操作： memcpy和sum64更快，因为第一次访问是读访问，而memset较慢，因为第一次访问是写访问。

This seems specific to the linux kernel, I do not observe the same difference on macOS, but with a different processor.这似乎特定于 linux 内核，我在 macOS 上没有观察到相同的差异，但使用了不同的处理器。 This behavior might be specific to older linux kernels and/or CPU and bridge architecture.此行为可能特定于较旧的 linux 内核和/或 CPU 和桥架构。

FINAL UPDATE: As commented by Peter Cordes , a read from a never-written anonymous page will get every page copy-on-write mapped to the same physical page of zeros, so you can get TLB misses but L1d cache hits when reading it.最终更新：正如Peter Cordes所评论的，从从未写入的匿名页面读取将使每个写入时复制的页面都映射到相同的零物理页面，因此您可以在读取时获得 TLB 未命中但 L1d 缓存命中。 (Applies to the .bss , and memory from mmap(MAP_ANONYMOUS) , like glibc calloc and malloc use for large allocations.) He wrote up some details with perf results for an experiment on Why is iterating though `std::vector` faster than iterating though `std::array`? （适用于.bss和mmap(MAP_ANONYMOUS)的内存，例如 glibc calloc和malloc用于大型分配。）他写了一些带有perf结果的详细信息，用于关于为什么迭代 `std::vector` 比通过`std::array`进行迭代？

This explains why memcpy or just reading from memory that is only implicitly initialized to zero is faster than from memory explicitly written to.这解释了为什么memcpy或仅从隐式初始化为零的内存读取比从显式写入的内存更快。 A good reason to use calloc() instead of malloc() + memset() for sparse data.使用calloc()而不是malloc() + memset()处理稀疏数据的一个很好的理由。

Answer 2

Linux uses virtual memory and backs-up that virtual memory with physical memory (allocated as pages) only on demand. Linux 使用虚拟内存并仅在需要时用物理内存（分配为页面）备份该虚拟内存。 When your two example programs request "memory" for a buffer using malloc() , only virtual memory is actually allocated.当您的两个示例程序使用malloc()为缓冲区请求“内存”时，实际上只分配了虚拟内存。 Only when your program uses that "memory" (eg write to that memory buffer) will Linux assign a physical memory page to map to the virtual page.只有当您的程序使用该“内存”（例如写入该内存缓冲区）时，Linux 才会分配一个物理内存页面以映射到虚拟页面。 This permits Linux to over-commit memory in a manner very similar to filesystem allocation using sparse files .这允许 Linux 以非常类似于使用稀疏文件的文件系统分配的方式过度使用内存。

When either of your programs initializes the allocated buffer using memset() , that would force assignment of physical pages to correspond to the virtual memory.当您的任何一个程序使用memset()初始化分配的缓冲区时，这将强制分配物理页面以对应于虚拟内存。 Perhaps this results in some page swapping during the socket transfer or buffer copying?也许这会导致在套接字传输或缓冲区复制期间发生一些页面交换？
But when the memory buffer has not been initialized (and not yet mapped to a physical page), then could there be an optimization of the page fault (due to a read for I/O or copy operation) to just access from a special page (sort of like reading from an unwritten part of a sparse file ), and not perform the page assignment?但是当内存缓冲区尚未初始化（并且尚未映射到物理页面）时，是否可以优化页面错误（由于读取 I/O 或复制操作）以仅从特殊页面访问（有点像从稀疏文件的未写入部分读取），而不执行页面分配？

Based on your question, there does seem to be somekind of page optimization for a page fault.根据您的问题，似乎确实对页面错误进行了某种页面优化。 So let's hypothesize that reading virtual memory that has not been written does not trigger an allocation and mapping of physical memory.所以我们假设，读取尚未写入的虚拟内存不会触发物理内存的分配和映射。

To test this hypothesis, we can use the top utility to obtain the virtual versus physical memory usage of your second program.为了检验这个假设，我们可以使用top实用程序来获取您的第二个程序的虚拟与物理内存使用情况。 The man page for top describes virtual memory usage is the total of " everything in-use and/or reserved (all quadrants) ." top的手册页描述了虚拟内存使用情况是“所有正在使用和/或保留的（所有象限） ”的总和。 Resident memory usage is " anything occupying physical memory which, beginning with Linux-4.5, is the sum of the following three fields:常驻内存使用是“任何占用物理内存的东西，从 Linux-4.5 开始，它是以下三个字段的总和：
RSan - quadrant 1 pages, which include any former quadrant 3 pages if modified RSan - 象限 1 页面，包括任何以前的象限 3 页面（如果已修改）
RSfd - quadrant 3 and quadrant 4 pages RSfd - 象限 3 和象限 4 页
RSsh - quadrant 2 pages " RSsh - 象限 2 页"

When your 2nd program initializes the allocated buffer, this slow version uses 518.564 MB of virtual memory and 515.172 MB of resident memory.当您的第二个程序初始化分配的缓冲区时，这个慢版本使用 518.564 MB 的虚拟内存和 515.172 MB 的常驻内存。 The similar memory numbers seems to indicate that the malloc() 'd buffer of 500MB is backed-up with physical memory (as it should be).类似的内存数字似乎表明malloc()的 500MB 缓冲区是用物理内存备份的（应该如此）。

When your 2nd program does not initialize the allocated buffer, this fast version uses the same 518.564 MB of virtual memory but only 3.192 MB of resident memory.当您的第二个程序未初始化分配的缓冲区时，此快速版本使用相同的 518.564 MB 虚拟内存，但仅使用 3.192 MB 常驻内存。 The disparate memory numbers are a good indication that (most if not all of) the malloc() 'd buffer of 500 MB is not backed-up with physical memory.不同的内存数量很好地表明（如果不是全部） malloc()的 500 MB 缓冲区没有用物理内存备份。

So the hypothesis seems valid.所以这个假设似乎是有效的。

Peter Cordes' comment confirms that there is such a page-fault optimization: that a " read from a never-written anonymous page will get every page copy-on-write mapped to the same physical page of zeros, so you can get TLB misses but L1d cache hits when reading it. " Peter Cordes 的评论证实存在这样的页面错误优化：“从从未写入的匿名页面读取将使每个写入时复制的页面都映射到相同的零物理页面，因此您可以获得 TLB 未命中但读取时 L1d 缓存命中。 ”

So your improved transfer rates and copy times seem to be due to reduced overhead from page swapping in the virtual memory subsystem and improved processor cache hits.因此，您改进的传输速率和复制时间似乎是由于减少了虚拟内存子系统中页面交换的开销和改进的处理器缓存命中。

从未初始化的缓冲区复制比从初始化的缓冲区快得多

问题描述

2 个解决方案

解决方案1
2 2022-06-12 16:16:30

解决方案2
1 2022-06-13 02:20:04

从未初始化的缓冲区复制比从初始化的缓冲区快得多

问题描述

2 个解决方案

解决方案1 2 2022-06-12 16:16:30

解决方案2 1 2022-06-13 02:20:04

解决方案1
2 2022-06-12 16:16:30

解决方案2
1 2022-06-13 02:20:04