简体   繁体   English

C/C++ memcpu 基准测试:测量 CPU 和挂墙时间

[英]C/C++ memcpu benchmark: measuring CPU and wall time

How can one benchmark memcpy?如何对 memcpy 进行基准测试? I wrote test code, but it finishes immediately (probably, due to compiler optimization) and does not actually allocate memory:我写了测试代码,但它立即完成(可能是由于编译器优化)并且实际上并没有分配 memory:

void test(void)
{
 const uint32_t size = 4000'000'000;
 char a[size], b[size];
 printf("start\n");
 for(int i=0; i<10'000'000; i++)
     memcpy(b, a, size*sizeof(char));
 printf("end\n");
}// end of function

I want to know the cost of memcpy in terms of CPU time and in terms of wall time.我想知道 memcpy 在 CPU 时间和挂墙时间方面的成本。

Here is the situation: I need to process incoming (through network) data at high rate.情况如下:我需要以高速率处理传入(通过网络)数据。 If I do not process it fast enough, the network buffers get overfilled and I am disconnected from the data source (which happens in my test code quite frequently).如果我处理它的速度不够快,网络缓冲区就会溢出,我会与数据源断开连接(这在我的测试代码中经常发生)。 I can see that the CPU usage of my process is quite low (10-15%) and so there should be some operation that costs time without costing CPU time.我可以看到我的进程的 CPU 使用率非常低(10-15%),因此应该有一些操作会花费时间而不花费 CPU 时间。 And so, I want to estimate the contribution of memcpy operations to the wall time it takes to process one unit of data.因此,我想估计 memcpy 操作对处理一个单元数据所需的时间的贡献。 The code is basically some computing and memory copy operations: there is no resource, which I need to wait for, that could slow me down.代码基本上是一些计算和 memory 复制操作:没有资源,我需要等待,这可能会减慢我的速度。

Thank you for your help!谢谢您的帮助!

[EDIT:] [编辑:]

Thank you very much for your comments.非常感谢您的意见。 And sorry for having an example which is not C (C++ only) - my priority was readability, Here is a new example of the code: which shows that memcpy is not free and consumes 100% of CPU time:很抱歉有一个不是 C 的示例(仅限 C++) - 我的首要任务是可读性,这是代码的一个新示例:它表明 memcpy 不是免费的并且消耗 100% 的 CPU 时间:

const uint32_t N = 1000'000'000;
char *a = new char[N], 
     *b = new char[N];
void test(void)
{
 for(uint32_t i=0; i<N; i++)
     a[i] = '7';

 printf("start\n");
 for(int i=0; i<100; i++)
     memcpy(b, a, N*sizeof(char));
 printf("end\n");
}// end of function

which makes me confused about why I have low CPU usage but do not process incoming data quickly enough.这让我对为什么我的 CPU 使用率低但处理传入数据的速度不够快感到困惑。

the idea was to test if memory copy is done by directly copying data in RAM with small participation of CPU (which is more likely to see if RAM chunks are large, and so the process is not dominated by CPU time).这个想法是测试 memory 复制是否是通过在 CPU 的少量参与下直接复制 RAM 中的数据来完成的(这更有可能看到 RAM 块是否很大,因此该过程不受 CPU 时间的支配)。

No, memcpy on normal computers doesn't offload to a DMA engine / blitter chip and let the CPU do other things until that completes.不,普通计算机上的memcpy不会卸载到 DMA 引擎/ blitter 芯片并让 CPU 做其他事情,直到完成。 The CPU itself does the copying, so as far as the OS is concerned memcpy is no different from any other instructions user-space could be running. CPU 本身进行复制,因此就操作系统而言,memcpy 与用户空间可以运行的任何其他指令没有什么不同。

A C++ implementation on an embedded system or an Atari Mega ST could plausibly do that, letting the OS schedule another task or at least do some housekeeping.嵌入式系统或Atari Mega ST上的 C++ 实现可以合理地做到这一点,让操作系统安排另一个任务或至少做一些内务处理。 Although only with very lightweight context switching because it doesn't take very long at all to copy even a huge block of memory.虽然只有非常轻量级的上下文切换,因为即使复制一大块 memory 也不需要很长时间。


An easier way to find that out would be to single-step into the memcpy library function.一个更简单的方法是单步进入memcpy库 function。 (And yes, with your update gcc doesn't optimize away the memcpy .) (是的,随着您的更新 gcc 不会优化memcpy 。)

Other than that, testing a 4GiB memcpy isn't very representative for network packets.除此之外,测试 4GiB memcpy 对网络数据包的代表性并不强。 glibc memcpy on x86 uses a different strategy (NT stores) for very huge copies. x86 上的 glibc memcpy对非常大的副本使用不同的策略(NT 存储)。 And for example the Linux kernel's read / recv paths end up using copy_to_user , I assume, which uses a different memory-copy function: hopefully rep movsb on x86 CPUs with the ERMSB feature.例如,我假设 Linux 内核的read / recv路径最终使用copy_to_user ,它使用不同的内存复制 function:希望在rep movsb上使用 EACEFACEF3D28D3 的 Z8A9DA7865483C5FDD359F3

See Enhanced REP MOVSB for memcpy for a bunch of x86 memory / cache performance details.有关一堆 x86 memory / 缓存性能详细信息,请参阅增强型 REP MOVSB for memcpy

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM