简体   繁体   中英

Measure the CPU cycles of C++ code

My goal is to measure the effect of (different) cache(s) using a simple code. I'm following this article, specifically page 20 and 21: https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

I'm working on a 64-bit linux. L1d cache is 32K, L2 is 256K, and L3 is 25M.

This is my code (I compile this code with g++ with no flags):

#include <iostream>

// ***********************************
// This is for measuring CPU clocks
#if defined(__i386__)
static __inline__ unsigned long long rdtsc(void)
{
    unsigned long long int x;
    __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
    return x;
}
#elif defined(__x86_64__)
static __inline__ unsigned long long rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
#endif
// ***********************************


static const int ARRAY_SIZE = 100;

struct MyStruct {
    struct MyStruct *n;
};

int main() {
    MyStruct myS[ARRAY_SIZE];
    unsigned long long cpu_checkpoint_start, cpu_checkpoint_finish;

    //  Initializing the array of structs, each element pointing to the next 
    for (int i=0; i < ARRAY_SIZE - 1; i++){
        myS[i].n = &myS[i + 1];
        for (int j = 0; j < NPAD; j++)
            myS[i].pad[j] = (long int) i;
    }
    myS[ARRAY_SIZE - 1].n = NULL;   // the last one
    for (int j = 0; j < NPAD; j++)
        myS[ARRAY_SIZE - 1].pad[j] = (long int) (ARRAY_SIZE - 1);

    // Filling the cache
    MyStruct *current = &myS[0];
    while ((current = current->n) != NULL)
        ;

    // Sequential access
    current = &myS[0];

    // For CPU usage in terms of clocks (ticks)
    cpu_start = rdtsc();

    while ((current = current->n) != NULL)
        ;

    cpu_finish = rdtsc();

    unsigned long long avg_cpu_clocks = (cpu_finish - cpu_start) / ARRAY_SIZE;

    std::cout << "Avg CPU Clocks:   " << avg_cpu_clocks << std::endl;
    return 0;
}

I have two problems:

1- I varied ARRAY_SIZE from 1 to 1,000,000 (so the size of my array ranges between 2B to 2MB), but the average CPU clock is always 10.

According to that PDF (figure 3-10 on page 21), I would have expected to get 3-5 clocks when the array can fit entirely into L1, and get higher numbers (9 cycles) when it exceeds L1's size.

2- If I increase ARRAY_SIZE beyond 1,000,000, I'll get segmentation fault (core dumped), which is due to stack overflow. My question is whether using dynamic allocation ( MyStruct *myS = new MyStruct[ARRAY_SIZE] ) does not incur any performance penalty.

This is my code (I compile this code with g++ with no flags)

If you don't pass -O3 , then while ((current = current->n) != NULL) will be compiled in to multiple memory accesses, not a single load instruction. By passing -O3 , the loop will be compiled into:

.L3:
mov     rax, QWORD PTR [rax]
test    rax, rax
jne     .L3

This will run at 4 cycles per iteration as you are expecting.

Note that you can use the __rdtsc compiler intrinsic instead of inline assembly. See: Get CPU cycle count? .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM