memcpy 性能不佳

Question

I am trying to optimize some code for speed and its spending a lot of time doing memcpys.我正在尝试优化一些代码以提高速度并花费大量时间来执行 memcpys。 I decided to write a simple test program to measure memcpy on its own to see how fast my memory transfers are and they seem very slow to me.我决定编写一个简单的测试程序来单独测量 memcpy，看看我的 memory 传输速度有多快，但它们对我来说似乎很慢。 I am wondering what might cause this.我想知道是什么原因造成的。 Here is my test code:这是我的测试代码：

#include <stdio.h>
#include <string.h>
#include <time.h>
#include <stdlib.h>

#define MEMBYTES 1000000000

int main() {
  clock_t begin, end;
  double time_spent[2];
  int i;

  // Allocate memory                                                                                                                                    

  float *src = malloc(MEMBYTES);
  float *dst = malloc(MEMBYTES);


  // Fill the src array with some numbers                                                                                                               
  begin = clock();
  for(i=0;i<250000000;i++)
    src[i]=(float) i;
  end = clock();
  time_spent[0] = (double)(end - begin) / CLOCKS_PER_SEC;


  // Do the memcpy                                                                                                                                      
  begin = clock();
  memcpy(dst, src, MEMBYTES);
  end = clock();
  time_spent[1] = (double)(end - begin) / CLOCKS_PER_SEC;

  //Print results                                                                                                                                       
  printf("Time spent in fill: %1.10f\n", time_spent[0]);
  printf("Time spent in memcpy: %1.10f\n", time_spent[1]);
  printf("dst[200]: %f\n", dst[400]);
  printf("dst[200000000]: %f\n", dst[200000000]);

  //Free memory                                                                                                                                         
  free(src);
  free(dst);
}

/*                                                                                                                                                      
                                                                                                                                                        
  gcc -O3 -o mct memcpy_test.c                                                                                                                          
                                                                                                                                                        
*/

When I run this, I get the following output:当我运行它时，我得到以下 output：

Time spent in fill: 0.4263950000
Time spent in memcpy: 0.6350150000
dst[200]: 400.000000
dst[200000000]: 200000000.000000

I think the theoretical memory bandwith for modern machines is tens of GB/s or perhaps over 100 GB/s.我认为现代机器的理论 memory 带宽是几十 GB/s 或者可能超过 100 GB/s。 I know in practice one cannot expect to hit the theoretical limits, and that for large memory transfers things can be slow, but I have seen people reporting measured speeds for large transfers of ~20GB/s (eg here ).我知道在实践中人们不能指望达到理论极限，并且对于 memory 的大型传输，事情可能会很慢，但我看到人们报告了 ~20GB/s 的大型传输的测量速度（例如这里）。 My results suggest I am getting 3.14GB/s (edit: I originally had 1.57, but stark pointed out in a comment that I need to count both read and write).我的结果表明我获得了 3.14GB/s（编辑：我最初有 1.57，但在评论中明确指出我需要计算读取和写入）。 I am wondering if anyone has ideas that might help or ideas of why the performance I am seeing is so low.我想知道是否有人有可能有帮助的想法或关于为什么我看到的性能如此低的想法。

My machine has two CPUS with 12 physical cores each (Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz) There is 192GB of RAM (I believe its 12x16GB DDR4-2666) The OS is Ubuntu 16.04.6 LTS我的机器有两个 CPUS，每个 12 个物理内核（Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz）有 192GB RAM（我相信它的 12x16GB DDR4-2666）操作系统是 Ubuntu 16.04.6 LTS

My compiler is: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609我的编译器是：gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609

Update更新

Thanks to all the valuable feedback I am now using a threaded implementation and getting much better performance.感谢所有宝贵的反馈，我现在正在使用线程实现并获得更好的性能。 Thank you!谢谢！

I had tried threading before posting with poor results (I thought), but as pointed out below I should have ensured I was using wall time.我在发布之前尝试过线程，但结果不佳（我认为），但正如下面指出的那样，我应该确保我使用的是墙上时间。 Now my results with 24 threads are as follows:现在我的 24 个线程的结果如下：

Time spent in fill: 0.4229530000
Time spent in memcpy (clock): 1.2897100000
Time spent in memcpy (gettimeofday): 0.0589750000

I am also using asmlib's A_memcpy with a large SetMemcpyCacheLimit value.我还使用具有较大 SetMemcpyCacheLimit 值的 asmlib 的 A_memcpy。

Answer 1

Saturating RAM is not as simple as is seems.饱和 RAM 并不像看起来那么简单。

First of all, at first glance here is the apparent throughput we can compute from the provided numbers:首先，乍一看，我们可以根据提供的数字计算出明显的吞吐量：

Fill: 1 / 0.4263950000 = 2.34 GB/s (1 GB is read);填充： 1 / 0.4263950000 = 2.34 GB/s（读取 1 GB）；
Memcpy: 2 / 0.6350150000 = 3.15 GB/s (1 GB is read and 1 GB is written). Memcpy： 2 / 0.6350150000 = 3.15 GB/s（读取 1 GB，写入 1 GB）。

The thing is that the pages allocated by malloc are not mapped in physical memory on Linux systems.问题是malloc分配的页面未映射到 Linux 系统上的物理 memory。 Indeed, malloc reserve some space in virtual memory , but the pages are only mapped in physical memory when a first touch is performed causing expensive page faults .实际上， malloc在虚拟 memory中保留了一些空间，但是当执行第一次触摸时页面仅映射到物理 memory 导致昂贵的页面错误。 AFAIK, the only way speed up this process is to use multiple cores or to prefill the buffers and reuse them later. AFAIK，加快此过程的唯一方法是使用多个内核或预填充缓冲区并在以后重用它们。

Additionally, due to architectural limitations (ie. latency), one core of a Xeon processor cannot saturate the RAM.此外，由于架构限制（即延迟），Xeon 处理器的一个核心无法使 RAM 饱和。 Again, the only way to fix that is to use multiple cores.同样，解决这个问题的唯一方法是使用多核。

If you try to use multiple core, then the result provided by the benchmark will be surprising since clock does not measure the wall-clock time but the CPU time (which is the sum of the time spent in all threads).如果您尝试使用多核，那么基准测试提供的结果将令人惊讶，因为clock测量的不是挂钟时间而是CPU 时间（这是所有线程中花费的时间的总和）。 You need to use another function. In C, you can use gettimeofday (which is not perfect as it is not monotonic ) but certainly good-enough for your benchmark (related post: How can I measure CPU time and wall clock time on both Linux/Windows? ).您需要使用另一个 function。在 C 中，您可以使用gettimeofday （这并不完美，因为它不是单调的）但对于您的基准测试来说肯定足够了（相关文章： How can I measure CPU time and wall clock time on both Linux /视窗？）。 In C++, you should use std::steady_clock (which is monotonic as opposed to std::system_clock ).在 C++ 中，您应该使用std::steady_clock （相对于std::system_clock是单调的）。

In addition, the write-allocate cache policy on x86-64 platform force cache lines to be read when they are written.此外，x86-64 平台上的写入分配缓存策略强制在写入时读取缓存行。 This means that to write 1 GB, you actually need to read 1 GB, That being said, x86-64 processors provide non-temporal store instructions that does not cause this issue (assuming your array is aligned properly and big enough).这意味着要写入 1 GB，您实际上需要读取 1 GB，也就是说，x86-64 处理器提供不会导致此问题的非临时存储指令（假设您的阵列正确对齐且足够大）。 Compilers can use that but GCC and Clang generally does not.编译器可以使用它，但 GCC 和 Clang 通常不会。 memcpy is already optimized to use non-temporal stores on most machines. memcpy已经过优化，可以在大多数机器上使用非临时存储。 For more information, please read How do non temporal instructions work?有关更多信息，请阅读非临时指令如何工作？ . .

Finally, you can parallelize the benchmark easily using OpenMP with simple #pragma omp parallel for directives on loops.最后，您可以使用OpenMP和简单的#pragma omp parallel for循环指令轻松并行化基准测试。 Note that is also provide a user-friendly function for computing the wall-clock time correctly: omp_get_wtime .请注意，还提供了一个用户友好的 function 以正确计算挂钟时间： omp_get_wtime 。 For the memcpy , the best is certainly to write a loop doing memcpy by (relatively big) chunks in parallel.对于memcpy ，最好的办法当然是编写一个循环，通过（相对较大的）块并行执行memcpy 。

For more information about this subject, I advise you to read the great famous document: What Every Programmer Should Know About Memory .有关此主题的更多信息，我建议您阅读著名的文档： What Every Programmer Should Know About Memory 。 Since the document is a bit old, you can check the updating information about this here .由于文档有点旧，你可以在这里查看更新信息。 The document also describe additional important things to understand why you may still not succeed saturate the RAM with the above information.该文档还描述了其他重要事项，以了解为什么您可能仍然无法使用上述信息成功地使 RAM 饱和。 One critical topic is NUMA .一个关键主题是NUMA 。

memcpy 性能不佳

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-03-11 19:06:19

memcpy 性能不佳

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-03-11 19:06:19

解决方案1
1 已采纳 2022-03-11 19:06:19