Why does my program run faster when I overload the system with other arbitrary work?

Question

I was running some timing and efficiency tests and came across some unexpected behavior. I found that my program actually ran faster if I ran other background processes that pegged all of the systems CPU cores at 100%. Here is a simplified example program:

#define _XOPEN_SOURCE 600
#include <stdlib.h>
#include <stdio.h>
#include <time.h>

void vadd(const float *u, const float *v, float *y, int n) {
    int  i;

    for (i = 0; i < n; i++) {
        y[i] = u[i] + v[i];
    }
}

int main(int argc, char *argv[]) {
    int i, its = 100000, n = 16384;
    float *a, *b, *c;
    clock_t start, end;
    double cpu_time;

    /* Make sure alignment is the same on each run. */
    posix_memalign((void**)&a, 16, sizeof(float) * n);
    posix_memalign((void**)&b, 16, sizeof(float) * n);
    posix_memalign((void**)&c, 16, sizeof(float) * n);

    /* Some arbitrary initialization */
    for (i = 0; i < n; i++) {
        a[i] = i;
        b[i] = 4;
        c[i] = 0;
    }

    /* Now the real work */
    start = clock();
    for (i = 0; i < its; i++) {
        vadd(a, b, c, n);
    }
    end = clock();

    cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
    printf("Done, cpu time: %f\n", cpu_time);

    return 0;
}

I'm running on a (rather old) Pentium 4 @ 2.8GHz with Hyper Threading turned on which shows up as two processors in /proc/cpuinfo.

Output with the system relatively idle:

$ ./test
Done, cpu time: 11.450000

And now loading all cores:

$ md5sum /dev/zero& ./test; killall md5sum
Done, cpu time: 8.930000

This result is consistent. I'm guessing that I've somehow improved cache efficiency by reducing the number of time the program gets moved to the other CPU, but this is just a shot in the dark. Can anyone confirm or refute this?

Secondary question: I was surprised to find that cpu_time could vary so much from run to run. The method used above is taken right out of the GNU C manual , and I thought that using clock() would protect me from timing fluctuations due to other processes using the CPU. Clearly based on the above results this isn't the case. So my secondary question is, is the clock() method really the proper way to measure performance?

Update: I've looked into the suggestions in the comments about CPU frequency scaling governor, and I don't think that's what is going on here. I've attempted to monitor the CPU speed in real time via watch grep \\"cpu MHz\\" /proc/cpuinfo (as suggested here ) and I don't see a frequency change while the programs are running. I should have also included in my post that I'm running a fairly old kernel: 2.6.25.

Update 2: I started using the script below to play around with the number of md5sum processes that are started. Even when I start more processes than logical CPU's it's faster than running stand alone.

Update 3: If I turn off Hyper Threading in the BIOS this strange behavior goes away and the run always takes around 11 seconds of CPU time. Looks like Hyper Threading has something to do with it.

Update 4: I just ran this on a dual quad core Intel Xeon @ 2.5GHz and didn't see any of the above strange behavior. This "issue" may be fairly specific to my particular hardware setup.

#!/bin/bash
declare -i num=$1

for (( num; num; num-- )); do
  md5sum /dev/zero &
done

time ./test
killall md5sum

--

$ ./run_test.sh 5
Done, cpu time: 9.070000

real    0m27.738s
user    0m9.021s
sys 0m0.052s

$ ./run_test.sh 2
Done, cpu time: 9.240000

real    0m15.297s
user    0m9.169s
sys 0m0.080s

$ ./run_test.sh 0
Done, cpu time: 11.040000

real    0m11.041s
user    0m11.041s
sys 0m0.004s

Answer 1

So my secondary question is, is the clock() method really the proper way to measure performance?

You could prefer using clock_gettime(2) and friends. Read also time(7)

Details could be hardware (ie CPU + motherboard) and kernel specific.

Answer 2

With a single process running on a core, clock() should return how much time that process spent running. This includes time that the core was actually executing and time that the core was waiting for things like fetching instructions and data from cache/memory, waiting for the results of one instruction that's needed by another instruction, etc. Basically, for this case, clock() returns "time spent executing plus lots of tiny little gaps".

For hyper-threading, the same core is shared by 2 "logical CPUs". The core uses all those tiny little gaps in one process to execute the other process, and the core does more total work in less time (due to less time wasted waiting). In this case what should the clock() function measure?

For example, if 2 processes both run on the same core for 10 seconds, should clock() say that both processes used 10 seconds each, or should clock() say that both processes used half of 10 seconds each?

My theory is that on your system clock() returns "core time consumed / processes consuming core's time". With one process running for 10 seconds clock() returns "10 seconds", and with 2 of these processes sharing the core they might run for 16 seconds instead of 20 seconds (due to the core wasting less time on "gaps") and clock() returns "16/2 = 8 seconds for each process"; making it seem like the process ran 2 seconds faster when there was more load (even though it took 16 seconds instead of 10 seconds).

Why does my program run faster when I overload the system with other arbitrary work?

Question

2 answers

solution1
1 2013-07-29 14:02:46

solution2
0 2013-07-29 15:46:18

Why does my program run faster when I overload the system with other arbitrary work?

Question

2 answers

solution1 1 2013-07-29 14:02:46

solution2 0 2013-07-29 15:46:18

solution1
1 2013-07-29 14:02:46

solution2
0 2013-07-29 15:46:18