简体   繁体   中英

Why does my CPU suddenly work twice as fast?

I've been trying to use a simple profiler to measure the efficiency of some C code on a school server, and I'm hitting an odd situation. After a short amount of time (half a second-ish), the processor suddenly starts executing instructions twice as fast. I've tested for just about every possible reason I could think of (caching, load balancing on cores, CPU frequency being altered due to coming out of sleep), but everything seems normal.

For what it's worth, I'm doing this testing on a school linux server, so it's possible there's an unusual configuration I don't know about, but the processor ID being used doesn't change, and (via top) the server was completely idle as I tested.

Test code:

#include <time.h>
#include <stdio.h>

#define MY_CLOCK CLOCK_MONOTONIC_RAW
// no difference if set to CLOCK_THREAD_CPUTIME_ID

typedef struct {
        unsigned int tsc;
        unsigned int proc;
} ans_t;

static ans_t rdtscp(void){
        ans_t ans;
        __asm__ __volatile__ ("rdtscp" : "=a"(ans.tsc), "=c"(ans.proc) : : "edx");
        return ans;
}

static void nop(void){
        __asm__ __volatile__ ("");
}

void test(){
        for(int i=0; i<100000000; i++) nop();
}

int main(){
        int c=10;
        while(c-->0){
                struct timespec tstart,tend;
                ans_t start = rdtscp();
                clock_gettime(MY_CLOCK,&tstart);
                test();
                ans_t end = rdtscp();
                clock_gettime(MY_CLOCK,&tend);
                unsigned int tdiff = (tend.tv_sec-tstart.tv_sec)*1000000000+tend.tv_nsec-tstart.tv_nsec;
                unsigned int cdiff = end.tsc-start.tsc;
                printf("%u cycles and %u ns (%lf GHz) start proc %u end proc %u\n",cdiff,tdiff,(double)cdiff/tdiff,start.proc,end.proc);
        }
}

Output I see:

351038093 cycles and 125680883 ns (2.793091 GHz) start proc 14 end proc 14
350911246 cycles and 125639359 ns (2.793004 GHz) start proc 14 end proc 14
350959546 cycles and 125656776 ns (2.793001 GHz) start proc 14 end proc 14
351533280 cycles and 125862608 ns (2.792992 GHz) start proc 14 end proc 14
350903833 cycles and 125636787 ns (2.793002 GHz) start proc 14 end proc 14
350924336 cycles and 125644157 ns (2.793002 GHz) start proc 14 end proc 14
349827908 cycles and 125251782 ns (2.792997 GHz) start proc 14 end proc 14
175289886 cycles and 62760404 ns (2.793001 GHz) start proc 14 end proc 14
175283424 cycles and 62758093 ns (2.793001 GHz) start proc 14 end proc 14
175267026 cycles and 62752232 ns (2.793001 GHz) start proc 14 end proc 14

I get similar output (with it taking a different number of tests to double in efficiency) using different optimization levels (-O0 to -O3).

Could it perhaps have something to do with hyperthreading, where two logical cores in a physical core (the server is using Xeon X5560s which may have this effect) can somehow "merge" to form one twice-as-fast processor?

Some systems scale the processor speed depending on the system load. As you justly note, this is particularly annoying when benchmarking.

If your server is running Linux, please type

cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor 

If this outputs ondemand , powersave or userspace , then CPU frequency scaling is active, and you're going to find it very difficult to do benchmarks. If this says performance , then CPU frequency scaling is disabled.

Some CPUs have optimizations on the chip, which are learning the path your code usually takes. By sucessfully forecast what the next if statement would do, it is not needed to discard the queue, and freshly load all the new operations from scratch. Depending on the chip and the algorithm, it might take 5 to 10 cycles, until it successfully forecasts the if statements. But somehow there are also reasons that speak against this as beeing the reason for this behaviour.

Looking at your Output i would say this might also just be the sheduling of the OS and or the CPU Frequency governor used there. Are you very sure the CPU frequency doesn't change during the execution of your code? No CPU boost? Using linux tools like cpufreq are often used to regulate the cpu frequency.

Hyper-threading means replicating the register space, not the actual decode/execution units - so this is not a solution.

To test the accuracy of the micro-benchmark method I would do the following:

  1. Run the program with high priority
  2. Count the number of instructions to see if it is correct. I would do that using perf stat ./binary - that means you need to have perf. I would do this multiple times and look at the clocks and instructions metrics to see how multiple instructions can execute in a single cycle.

I have some additional remarks :

For each nop you also to a comparison and a conditional jump in the for loop. If you really want to execute NOPs I'd write a statement like this:

#define NOP5 __asm__ __volatile__ ("nop nop nop nop nop");
#define NOP25 NOP5 NOP5 NOP5 NOP5 NOP5
#define NOP100 NOP25 NOP25 NOP25 NOP25
#define NOP500 NOP100 NOP100 NOP100 NOP100 NOP100
...
for(int i=0; i<100000000; i++)
{
   NOP500 NOP500 NOP500 NOP500
}

This construct will allow you to actually do NOP's instead of comparing i with 100M .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM