C/C++ memcpu benchmark: measuring CPU and wall time

Question

How can one benchmark memcpy? I wrote test code, but it finishes immediately (probably, due to compiler optimization) and does not actually allocate memory:

void test(void)
{
 const uint32_t size = 4000'000'000;
 char a[size], b[size];
 printf("start\n");
 for(int i=0; i<10'000'000; i++)
     memcpy(b, a, size*sizeof(char));
 printf("end\n");
}// end of function

I want to know the cost of memcpy in terms of CPU time and in terms of wall time.

Here is the situation: I need to process incoming (through network) data at high rate. If I do not process it fast enough, the network buffers get overfilled and I am disconnected from the data source (which happens in my test code quite frequently). I can see that the CPU usage of my process is quite low (10-15%) and so there should be some operation that costs time without costing CPU time. And so, I want to estimate the contribution of memcpy operations to the wall time it takes to process one unit of data. The code is basically some computing and memory copy operations: there is no resource, which I need to wait for, that could slow me down.

Thank you for your help!

[EDIT:]

Thank you very much for your comments. And sorry for having an example which is not C (C++ only) - my priority was readability, Here is a new example of the code: which shows that memcpy is not free and consumes 100% of CPU time:

const uint32_t N = 1000'000'000;
char *a = new char[N], 
     *b = new char[N];
void test(void)
{
 for(uint32_t i=0; i<N; i++)
     a[i] = '7';

 printf("start\n");
 for(int i=0; i<100; i++)
     memcpy(b, a, N*sizeof(char));
 printf("end\n");
}// end of function

which makes me confused about why I have low CPU usage but do not process incoming data quickly enough.

Answer 1

the idea was to test if memory copy is done by directly copying data in RAM with small participation of CPU (which is more likely to see if RAM chunks are large, and so the process is not dominated by CPU time).

No, memcpy on normal computers doesn't offload to a DMA engine / blitter chip and let the CPU do other things until that completes. The CPU itself does the copying, so as far as the OS is concerned memcpy is no different from any other instructions user-space could be running.

A C++ implementation on an embedded system or an Atari Mega ST could plausibly do that, letting the OS schedule another task or at least do some housekeeping. Although only with very lightweight context switching because it doesn't take very long at all to copy even a huge block of memory.

An easier way to find that out would be to single-step into the memcpy library function. (And yes, with your update gcc doesn't optimize away the memcpy .)

Other than that, testing a 4GiB memcpy isn't very representative for network packets. glibc memcpy on x86 uses a different strategy (NT stores) for very huge copies. And for example the Linux kernel's read / recv paths end up using copy_to_user , I assume, which uses a different memory-copy function: hopefully rep movsb on x86 CPUs with the ERMSB feature.

See Enhanced REP MOVSB for memcpy for a bunch of x86 memory / cache performance details.

C/C++ memcpu benchmark: measuring CPU and wall time

Question

1 answers

solution1
1 ACCPTED 2019-11-08 22:31:06

C/C++ memcpu benchmark: measuring CPU and wall time

Question

1 answers

solution1 1 ACCPTED 2019-11-08 22:31:06

solution1
1 ACCPTED 2019-11-08 22:31:06