简体   繁体   中英

measuring code execution times in C using RDTSC instruction

I wrote a simple program to measure the code execution times using RDTSC instruction. But I don't know whether my result is correct and anything wrong with my code...I have no idea how to verify it.

#include <stdio.h>
#include <assert.h>
#include <stdint.h>
#include <stdlib.h>

#define N (1024*4)

unsigned cycles_low, cycles_high, cycles_low1, cycles_high1;

static __inline__ unsigned long long rdtsc(void)
{
    __asm__ __volatile__ ("RDTSC\n\t"
            "mov %%edx, %0\n\t"
            "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
            "%rax", "rbx", "rcx", "rdx");
}

static __inline__ unsigned long long rdtsc1(void)
{
    __asm__ __volatile__ ("RDTSC\n\t"
            "mov %%edx, %0\n\t"
            "mov %%eax, %1\n\t": "=r" (cycles_high1), "=r" (cycles_low1)::
            "%rax", "rbx", "rcx", "rdx");
}

int main(int argc, char* argv[])
{
    uint64_t start, end;

    rdtsc();
    malloc(N);
    rdtsc1();

    start = ( ((uint64_t)cycles_high << 32) | cycles_low );
    end = ( ((uint64_t)cycles_high1 << 32) | cycles_low1 );

    printf("cycles spent in allocating %d bytes of memory: %llu\n",N, end - start);

    return 0;
}

There are some (non-obvious) issues that you should keep in mind when using RDTSC to time things:

  1. The frequency of the clock that it counts may be unpredictable. On older hardware, the frequency may actually change in between two RDTSC instructions, and even on newer hardware where it is fixed, it can be difficult to tell what frequency it runs at.

  2. Since RDTSC has no inputs, the CPU itself may reorder the RDTSC instruction to come before the code you are trying to measure. Note that this is a different problem from the compiler reordering the code, which you've avoided with __volatile__. To effectively avoid this, you have to execute a serializing instruction , which is an instruction which will prevent the CPU from moving an instruction before it. You can use either CPUID or RDTSCP (which is just a serializing form of RDTSC)

My suggestion: just use whatever high frequency timer API your OS has. On Windows this is QueryPerformanceCounter and on Unix you have gettimeofday or clock_gettime.

Aside from that, your RDTSC code has a few structural issues. The return type is "unsigned long long", but nothing is actually returned. If you fix that, you can avoid storing the result in global variables and you can avoid having to write multiple versions.

Problems that may effect the results you get are:

  • on most modern 80x86 CPUs TSC measures a fixed frequency clock and not cycles, and therefore the same piece of code can have wildly different "cycles" depending on power management, the load on other logical CPUs in the same core (hyper-threading), the load on other cores (turbo-boost), CPU temperature (thermal throttling), etc.

  • nothing prevents the OS's scheduler from pre-empting your thread immediately after the first rdtsc(); causing the resulting "cycles spent allocating" to include the time the CPU spent executing any number of completely different processes.

  • on some computers the TSC on different CPUs isn't synchronised; and nothing prevents the OS from pre-empting your thread immediately after the first rdtsc(); and then running your thread on a completely different CPU (with a completely different TSC). In this case it's possible for end - start to be negative (like time is going backwards).

  • nothing prevents an IRQ (from hardware) from interrupting your code immediately after the first rdtsc(); causing the resulting "cycles spent allocating" to include the time the OS spent handling any number of IRQs.

  • its impossible to prevent an SMI ("System Management Interrupt") causing the CPU to enter SMM ("System Management Mode") and executing hidden firmware code after the first rdtsc(); causing the resulting "cycles spent allocating" to include the time the CPU spent executing firmware code.

  • some (old) CPUs have a bug where rdtsc gives dodgy results when the lower 32 bits overflow (eg when the TSC goes from 0x00000000FFFFFFFF to 0x0000000100000000 you can use rdtsc at the exact wrong time and get 0x0000000000000000).

  • nothing prevents an "out-of-order" modern CPU from rearranging the order that most instructions are executed in, including your rdtsc instructions.

  • your measurement includes the overhead of measuring (eg if rdtsc takes 5 cycles and your malloc() costs 20 cycles, then you report 25 cycles and not 20 cycles).

  • with or without a virtual machine; it's possible that the rdtsc instruction is virtualised (eg nothing other than common sense prevents a kernel from making rdtsc report how much free disk space there is or anything else it likes). Ideally rdtsc should be virtualised to prevent most of the problems mentioned above and/or to prevent timing side-channels (but it almost never is).

  • on extremely old CPUs (80486 and older) the TSC and rdtsc instruction doesn't exist.


Note: I'm not an expert in GCC's inline assembly; but I strongly suspect your macros are buggy and that the compiler could chose to generate something like this:

    rdtsc
    mov %edx, %eax        ;Oops, trashed the low 32 bits
    mov %eax, %ebx

It should be possible to tell GCC that the value/s are returned in EDX:EAX and get rid of both mov instructions completely.

Note: As I was writing this, I came up with a simpler/cleaner way to calibrate the TSC conversion factor. So, keep reading ...

If you wish, under linux [some other OSes have similar--eg BSD implements a portion of linux /proc], in /proc/cpuinfo , you'll see fields like this:

bogomips    :  5306.71
flags       :  blah blah2 constant_tsc
processor   :  blah

If you read this file, the bogomips is the total CPU frequency in Mhz [sort of] calculated during system boot. Prefer it over cpu Mhz if your machine has speed step.

To use bogomips , count the number of processor lines and divide bogomips by it. Note strip out the "." and treat it as Khz and use integer math.

If you've got constant_tsc , the TSC will always run at this [maximum] frequency and will never vary, regardless if a particular core is slowed due to speed step.

If reading /proc/cpuinfo makes you squeamish, there is an alternate way to calibrate/determine the TSC frequency.

Do the following:

tsc1 = rdtsc
clk1 = clock_gettime

// delay for a while
for (i = 1;  i < 1000000;  ++i)
    asm volatile ("" ::: "memory");

clk2 = clock_gettime
tsc2 = rdtsc

With these values you can compute the TSC frequency. Do the above a few thousand times. Take the minimum delta--this guards against those measurements where the OS time sliced you out.

Use the largest value for value for loop count that doesn't cause a time slice. Actually, you could replace the loop with a nanosleep with tv_sec = 0, tv_nsec = 500000 (500 us). nanosleep is much better than the equiv usleep . Actually, you could nanosleep for 2-3 seconds if you wanted.

The clk2 - clk2 value [converted] to fractional seconds, gives you the calibration for tsc2 - tsc1 and the conversion to/from TSC ticks and seconds.

There is "=A" for 32-bit platforms. This creates the 64-bit result from eax and edx. Sadly, on 64-bit platforms, it just means the rax register, which is no help.

Instead, and much better, you could use the "__builtin_ia32_rdtsc()" intrinsic which returns a 64-bit unsigned integer directly. Similarly for rdtscp (which also returns the current core). See the gcc manual. These do emit slightly better code than doing it by hand with inline asm and are portable between 32 and 64 bits.

If "constant_tsc" is set in the /proc/cpuinfo flags, the TSC runs at a constant rate regardless of any CPU frequency scaling. If "nonstop_tsc" is set, the TSC continues to run in C (sleep) states. If both are set, the counters "should" also be synchronised across cores (at least on recent CPU's, Core i7 or later). I'm not too sure about the last, perhaps someone could correct me?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM