why cache doesn't work as it supposed to be?

Question

According to http://igoro.com/archive/gallery-of-processor-cache-effects/ , while trying Example 2, time should drop until the offset equals to the cache line zie.
However, on my machine, it doesn't work.
The code is just like following.

#define SIZE 1024*1024*64

int main()
{
struct timeval start, end;
int k;
int i;

for(k = 1; k <= 1024; k *= 2)
{

    int *arr = (int*)malloc(SIZE * sizeof(int));
    gettimeofday(&start, NULL);
    for(i = 0; i < SIZE; i += k)
        arr[i] *= 3;
    gettimeofday(&end, NULL);

    printf("K = %d, time = %d\n", k,
            (end.tv_sec - start.tv_sec)*1000000 + (end.tv_usec - start.tv_usec));

    free(arr);
}
return 0;
}

The result comes out as:

K = 1, time = 410278
K = 2, time = 265313
K = 4, time = 201540
K = 8, time = 169800
K = 16, time = 155123
K = 32, time = 142496
K = 64, time = 137967
K = 128, time = 135818
K = 256, time = 135128
K = 512, time = 135167
K = 1024, time = 135462

Answer 1

It depends upon the compiler (its version), the optimization levels , and the CPU. Apparently, most of the time is spent in malloc so I moved it out of the loop and increased SIZE .

I'm trying on Debian/Sid with GCC 4.8.1 on a i3770K processor with 16Gbytes of RAM.

with

#include <stdio.h>
#include <stdlib.h> 
#include <sys/time.h>
#include <time.h>
#define SIZE 1024*1024*1024

int main ()
{
  struct timeval start, end;
  clock_t startcl, endcl;
  int k, i;

  int *arr = (int *) malloc (SIZE * sizeof (int));
  if (!arr) { perror("malloc"); exit(EXIT_FAILURE); };
  for (k = 1; k <= 1024; k *= 2)  {
      gettimeofday (&start, NULL);
      startcl = clock();
      for (i = 0; i < SIZE; i += k)
        arr[i] *= 3;
      gettimeofday (&end, NULL);
      endcl = clock();
      printf ("K = %d, time = %ld, cpu clock=%ld microsec\n", k,
              (end.tv_sec - start.tv_sec) * 1000000 
              + (end.tv_usec - start.tv_usec),
              (long) (endcl - startcl));
    }
  free (arr);
  return 0;
}

and compiling with gcc -Wall -mtune=native -O3 ./wilsonwen.c -o ./wilsonwen-O3 then running it:

 K = 1, time = 696074, cpu clock=680000 microsec K = 2, time = 361173, cpu clock=360000 microsec K = 4, time = 341920, cpu clock=340000 microsec K = 8, time = 341767, cpu clock=340000 microsec K = 16, time = 342065, cpu clock=340000 microsec K = 32, time = 224502, cpu clock=230000 microsec K = 64, time = 119544, cpu clock=120000 microsec K = 128, time = 51089, cpu clock=50000 microsec K = 256, time = 26447, cpu clock=20000 microsec K = 512, time = 14104, cpu clock=20000 microsec K = 1024, time = 8385, cpu clock=10000 microsec

which is more consistent with the blog you mentioned. Moving the malloc out of the outer loop on k is really important (if you don't, you don't see the cache effect, apparently because malloc and the underlying mmap syscall is eating quite a lot of time).

I cannot explain why for k=1 it takes more time (perhaps, because the malloc -ed memory is taken into RAM by page faults?). Even by adding a for (i=0; i<SIZE/1024; i++) arr[i] = i; loop to "pre-fetch the pages" before your for (k loop the time for k=1 is still nearly twice as bigger than for k=2 . We do see the plateau for k=2 to k=16 mentioned in Igor Ostrovsky's blog. Replacing malloc with calloc is not very significant. Using clang (3.2) instead of gcc (4.8) for compilation gives very similar timing results.

Optimizing is very important , by trying with gcc -Wall -O0 ./wilsonwen.c -o ./wilsonwen-O0 and running that I don't see any plateau (which you'll see even with -O1 ). It is well known that gcc without any optimization flags spit quite poor machine code.

A general rule when benchmarking is to enable the compiler optimizations.

Answer 2

The same as mine.

Someone got the same result in the discussion under that article. And he said maybe it is just the the real CPU time for arr[i] *= 3.

When K = 1 , it needs to run SIZE times. But when K = 2, it only needs to run SIZE/2 times.

So you could rewrite the code to run the same times despite whatever the K is. Then check whether it is the same. I just had this thought while typing. I will try this later. If you have tried , please add a comment.

Here is my code:

#include <stdio.h>
#include <time.h>
#include <stdint.h>
#define SIZE 64*1024*1024
int32_t arr [SIZE];
struct timespec ts;

int main(int argc, char *argv[])
{
long i,j= 0;
long start;
long start_sec;
int count = 1;
int k = 0;

// init the arr;
for (i = 0; i< 64*1024*1024;++i){
    arr[i] = 0;
}

for (j = 1; j< 1025;){
    clock_gettime(CLOCK_REALTIME, &ts);
    start = ts.tv_nsec;
    start_sec = ts.tv_sec;
    for (i = 0, k = 0; i< 64*1024*1024; i++, k+=j){
        k = k & (SIZE -1);
        arr[k] *=3;
        arr[k] =1;
    }
    clock_gettime(CLOCK_REALTIME, &ts);
    printf ("%d, %ld, %ld\n", count,(ts.tv_sec-start_sec)*1000000000+(ts.tv_nsec -start), j);
    count++;
    j *= 2;
}
return 0;
}

and the output is below:

1, 352236657, 1
2, 356920027, 2
3, 375986006, 4
4, 494875602, 8
5, 957796009, 16
6, 1397285233, 32
7, 1784398514, 64
8, 1070586859, 128
9, 1130548756, 256
10, 1169113810, 512
11, 1312605482, 1024

If I comment out the arr-init loop, it takes more time than K=2 when K=1.

And we can see that, the time just increases as the expectation(before K = 128). Because we alway have a loop of 64*1024*1024 times despite the K. Greater the K is, more time the cache line will flush.

Well, but I can't explain the decreasing from K = 64 to K = 128.

And @Mysticial talked about the lazy-malloc, so I also did a experiment with the original code in the article, but added the arr-init loop to avoid the lazy-malloc problem. The cost of K = 1 did decrease, but it is still greater than K=2's cost and closer to K=2's cost * 2 than original version. Data is below:

1, 212882204, 1
2, 111660951, 2
3, 67843457, 4
4, 62980310, 8
5, 62092973, 16
6, 42531407, 32
7, 27686909, 64
8, 9142755, 128
9, 4064936, 256
10, 2342842, 512
11, 1130305, 1024

So I think the reason of the decreasing from K = 1 to K = 2 and K=2 to K=4 is the decreasing of the number of iterations from SIZE to SIZE/2.

This is what I think, but I'm not sure.

======================================================

I complied the code with -Ox, the decreasing disappeared (but have to add the arr-init loop). Thanks to @Basile. I will check the differences in the asm code later.

This is the differences between the asm code:

Without O1,

    movq    $0, -32(%rbp)
    jmp .L5
.L6:
    movq    -32(%rbp), %rax
    movl    arr(,%rax,4), %edx
    movl    %edx, %eax
    addl    %eax, %eax
    addl    %eax, %edx
    movq    -32(%rbp), %rax
    movl    %edx, arr(,%rax,4)
    movq    -24(%rbp), %rax
    addq    %rax, -32(%rbp)
.L5:
    cmpq    $67108863, -32(%rbp)
    jle .L6

And with O1,

    movl    $0, %eax 
.L3:
    movl    arr(,%rax,4), %ecx # ecx = a[i]
    leal    (%rcx,%rcx,2), %edx # edx = 3* rcx
    movl    %edx, arr(,%rax,4) # a[i] = edx
    addq    %rbx, %rax # rax += rbx
    cmpq    $67108863, %rax 
    jle .L3

And I change the asm code without O1 to this,

    movq    $0, -32(%rbp) 
    movl    $0, %eax
    movq    -24(%rbp), %rbx
    jmp .L5
.L6:
    movl    arr(,%rax,4), %edx 
    movl    %edx, %ecx 
    addl    %ecx, %ecx 
    addl    %ecx, %edx 
    movl    %edx, arr(,%rax,4) 
    addq    %rbx, %rax
.L5:
    cmpq    $67108863, %rax 
    jle .L6

Then I get the result:

1, 64119476, 1
2, 63417463, 2
3, 63732534, 4
4, 66703562, 8
5, 65740635, 16
6, 47743618, 32
7, 28402013, 64
8, 9444894, 128
9, 4544371, 256
10, 2991025, 512
11, 1242882, 1024

It's almost just the same as the O1. It seems that the movq -32(%rbp), %rax stuff costs too much. But I don't know why.

Maybe I'd better ask a new question about it.

why cache doesn't work as it supposed to be?

Question

2 answers

solution1
2 ACCPTED 2013-07-29 02:35:50

solution2
0 2013-07-29 02:10:24

why cache doesn't work as it supposed to be?

Question

2 answers

solution1 2 ACCPTED 2013-07-29 02:35:50

solution2 0 2013-07-29 02:10:24

solution1
2 ACCPTED 2013-07-29 02:35:50

solution2
0 2013-07-29 02:10:24