为什么 16 字节的 memcpy 比 1 字节的 memcpy 快两倍？

Question

I'm doing some performance measurements.我正在做一些性能测量。 When looking at memcpy, I found a curious effect with small byte sizes.在查看 memcpy 时，我发现小字节大小的奇怪效果。 In particular, the fastest byte count to use for my system is 16 bytes.特别是，我的系统使用的最快字节数是 16 个字节。 Both smaller and larger sizes get slower.更小和更大的尺寸都会变慢。 Here's a screenshot of my results from a larger test program.这是我从一个更大的测试程序中得到的结果的屏幕截图。

I've minimized a complete program to reproduce the effect just for 1 and 16 bytes (note this is MSVC code to suppress inlining to prevent optimizer to nuke everything):我已经最小化了一个完整的程序来重现仅 1 和 16 个字节的效果（注意这是 MSVC 代码，用于抑制内联以防止优化器破坏所有内容）：

#include <chrono>
#include <iostream>

using dbl_ns = std::chrono::duration<double, std::nano>;

template<size_t n>
struct memcopy_perf {
   uint8_t m_source[n]{};
   uint8_t m_target[n]{};
   __declspec(noinline) auto f() -> void
   {
      constexpr int repeats = 50'000;
      const auto t0 = std::chrono::high_resolution_clock::now();
      for (int i = 0; i < repeats; ++i)
         std::memcpy(m_target, m_source, n);
      const auto t1 = std::chrono::high_resolution_clock::now();
      std::cout << "time per memcpy: " << dbl_ns(t1 - t0).count()/repeats << " ns\n";
   }
};

int main()
{
   memcopy_perf<1>{}.f();
   memcopy_perf<16>{}.f();

   return 0;
}

I would have hand-waved away a minimum at 8 bytes (maybe because of 16 bit on a 64 bit register size) or at 64 bytes (cache line size).我会在 8 字节（可能是因为 64 位寄存器大小上的 16 位）或 64 字节（高速缓存行大小）时手动挥手。 But I'm somewhat puzzled at the 16 bytes.但我对这 16 个字节感到有些困惑。 The effect is reproducible on my system and not a fluke.这种效果在我的系统上是可重现的，而不是侥幸。

Notes: I'm aware of the intricacies of performance measurements.注意：我知道性能测量的复杂性。 Yes, this is in release mode.是的，这是在发布模式。 Yes, I made sure things are not optimized away.是的，我确保事情没有被优化掉。 Yes, I'm aware there are libraries for this.是的，我知道有这方面的图书馆。 Yes, n is too low, etc etc. This is a minimal example.是的，n 太低了，等等。这是一个最小的例子。 I checked the asm, it calls memcpy.我检查了asm，它调用了memcpy。

Answer 1

I assume that MSVC 19 is used since there is not information about the MSVC version yet.我假设使用了 MSVC 19，因为还没有关于 MSVC 版本的信息。

The benchmark is biased so it cannot actually help to tell which one is faster, especially for such a small timing.基准是有偏差的，因此它实际上无法帮助判断哪个更快，尤其是对于这么小的时间。

Indeed, here is the assembly code of MSVC 19 with /O2 between the two calls to std::chrono::high_resolution_clock::now :实际上，这是 MSVC 19 的汇编代码，在两次调用std::chrono::high_resolution_clock::now之间带有/O2 ：

        mov     cl, BYTE PTR [esi]
        add     esp, 4
        mov     eax, 50000                                ; 0000c350H
        npad    6

$LL4@f:
        sub     eax, 1            <--- This is what you measure !
        jne     SHORT $LL4@f

        lea     eax, DWORD PTR _t1$[esp+20]
        mov     BYTE PTR [esi+1], cl
        push    eax

One can see that you measure a nearly empty loop that is completely useless and more than half the instructions are not the ones meant to be measured.可以看到，您测量了一个几乎空无一物的循环，并且一半以上的指令不是要测量的指令。 The only useful instructions are certainly:唯一有用的说明当然是：

        mov     cl, BYTE PTR [esi]
        mov     BYTE PTR [esi+1], cl

Even assuming that the compiler could generate such a code, the call to std::chrono::high_resolution_clock::now certainly takes several dozens of nano seconds (since the usual solution to measure very small timing precisely is to use RDTSC and RDTSCP ).即使假设编译器可以生成这样的代码，对std::chrono::high_resolution_clock::now的调用肯定需要几十纳秒（因为精确测量非常小的时序的通常解决方案是使用RDTSC和RDTSCP ）。 This is far more than the time to execute such instruction.这远远超过了执行此类指令的时间。 In fact, at such a granularity the notion of wall clock time vanishes.事实上，在这样的粒度下，挂钟时间的概念就消失了。 Instructions are typically executed in parallel in an out of order way and are also pipelined.指令通常以无序的方式并行执行，并且也是流水线的。 At this scale one need to consider the latency of each instruction, their reciprocal throughput, their dependencies, etc.在这种规模下，需要考虑每条指令的延迟、它们的相互吞吐量、它们的依赖关系等。

An alternative solution is to reimplement the benchmark so the compiler cannot optimize this.另一种解决方案是重新实现基准测试，以便编译器无法对其进行优化。 But this is pretty hard to do since copying 1 byte is nearly free on modern x86 architectures (compared to the overhead of the related instructions needed to compute the addresses, loop, etc.).但这很难做到，因为在现代 x86 架构上复制 1 个字节几乎是免费的（与计算地址、循环等所需的相关指令的开销相比）。

AFAIK copying 1 byte on an AMD Zen2 processor has a reciprocal throughput of ~1 cycle (1 load + 1 store scheduled on both 2 load ports and 2 store ports) assuming data is in the L1 cache.假设数据在 L1 缓存中，AFAIK 在 AMD Zen2 处理器上复制 1 个字节的倒数吞吐量约为 1 个周期（1 个加载 + 1 个存储在 2 个加载端口和 2 个存储端口上调度）。 The latency to read/write a value in the L1 cache is 4-5 cycles so the latency of the copy may be 8-10 cycles.在 L1 缓存中读取/写入值的延迟是 4-5 个周期，因此复制的延迟可能是 8-10 个周期。 For more information about this architecture please check this .有关此架构的更多信息，请查看此。

Answer 2

Best way to know is to check the resulting assembly.最好的了解方法是检查生成的程序集。 for example if 16 byte version is aligned on 16 then it uses aligned load/store operations and becomes faster than 4 byte / 1 byte version.例如，如果 16 字节版本在 16 上对齐，则它使用对齐的加载/存储操作并变得比 4 字节/1 字节版本快。

Maybe if it can not vectorize(use register) due to bad alignment, it falls back to do cache copy like mov operations on memory address instead of register.也许如果由于对齐错误而无法矢量化（使用寄存器），它会退回到在内存地址而不是寄存器上进行缓存复制，如 mov 操作。

Answer 3

The memcpy implementation is depend on many factors, such as: memcpy的实现取决于许多因素，例如：

OS: windows 10, linux, macos,...操作系统：Windows 10、Linux、macOS、...
Compiler: GCC 4.4, GCC 8.5, CL, cywin-gcc, clang...编译器：GCC 4.4、GCC 8.5、CL、cywin-gcc、clang...
CPU architecture: arm, arm64, x86, x86_64,.... CPU架构：arm, arm64, x86, x86_64,....
... ...

So, with difference system, memcpy may have difference performance/behavior.因此，对于不同的系统， memcpy可能具有不同的性能/行为。 Example: SSE2, SSE3, AVX, AVX512 memcpy version be implmented on this system, but not on other system.示例：SSE2、SSE3、AVX、AVX512 memcpy 版本在本系统上实现，但不在其他系统上。

In your case, I think it's mainly caused by memory alignment .在您的情况下，我认为这主要是由memory alignment引起的。 And with 1 byte copy, the un-alignment memory access is always happen (more read/write cycle).并且对于1 byte的副本，总是会发生un-alignment内存访问（更多的读/写周期）。

为什么 16 字节的 memcpy 比 1 字节的 memcpy 快两倍？

问题描述

3 个解决方案

解决方案1
0 2022-05-31 20:28:24

解决方案2
0 2022-05-31 20:52:20

解决方案3
0 2022-06-02 03:31:19

为什么 16 字节的 memcpy 比 1 字节的 memcpy 快两倍？

问题描述

3 个解决方案

解决方案1 0 2022-05-31 20:28:24

解决方案2 0 2022-05-31 20:52:20

解决方案3 0 2022-06-02 03:31:19

解决方案1
0 2022-05-31 20:28:24

解决方案2
0 2022-05-31 20:52:20

解决方案3
0 2022-06-02 03:31:19