According to http://igoro.com/archive/gallery-of-processor-cache-effects/ , while trying Example 2, time should drop until the offset equals to the cache line zie.
However, on my machine, it doesn't work.
The code is just like following.
#define SIZE 1024*1024*64
int main()
{
struct timeval start, end;
int k;
int i;
for(k = 1; k <= 1024; k *= 2)
{
int *arr = (int*)malloc(SIZE * sizeof(int));
gettimeofday(&start, NULL);
for(i = 0; i < SIZE; i += k)
arr[i] *= 3;
gettimeofday(&end, NULL);
printf("K = %d, time = %d\n", k,
(end.tv_sec - start.tv_sec)*1000000 + (end.tv_usec - start.tv_usec));
free(arr);
}
return 0;
}
The result comes out as:
K = 1, time = 410278
K = 2, time = 265313
K = 4, time = 201540
K = 8, time = 169800
K = 16, time = 155123
K = 32, time = 142496
K = 64, time = 137967
K = 128, time = 135818
K = 256, time = 135128
K = 512, time = 135167
K = 1024, time = 135462
It depends upon the compiler (its version), the optimization levels , and the CPU. Apparently, most of the time is spent in malloc
so I moved it out of the loop and increased SIZE
.
I'm trying on Debian/Sid with GCC 4.8.1 on a i3770K processor with 16Gbytes of RAM.
with
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>
#define SIZE 1024*1024*1024
int main ()
{
struct timeval start, end;
clock_t startcl, endcl;
int k, i;
int *arr = (int *) malloc (SIZE * sizeof (int));
if (!arr) { perror("malloc"); exit(EXIT_FAILURE); };
for (k = 1; k <= 1024; k *= 2) {
gettimeofday (&start, NULL);
startcl = clock();
for (i = 0; i < SIZE; i += k)
arr[i] *= 3;
gettimeofday (&end, NULL);
endcl = clock();
printf ("K = %d, time = %ld, cpu clock=%ld microsec\n", k,
(end.tv_sec - start.tv_sec) * 1000000
+ (end.tv_usec - start.tv_usec),
(long) (endcl - startcl));
}
free (arr);
return 0;
}
and compiling with gcc -Wall -mtune=native -O3 ./wilsonwen.c -o ./wilsonwen-O3
then running it:
K = 1, time = 696074, cpu clock=680000 microsec K = 2, time = 361173, cpu clock=360000 microsec K = 4, time = 341920, cpu clock=340000 microsec K = 8, time = 341767, cpu clock=340000 microsec K = 16, time = 342065, cpu clock=340000 microsec K = 32, time = 224502, cpu clock=230000 microsec K = 64, time = 119544, cpu clock=120000 microsec K = 128, time = 51089, cpu clock=50000 microsec K = 256, time = 26447, cpu clock=20000 microsec K = 512, time = 14104, cpu clock=20000 microsec K = 1024, time = 8385, cpu clock=10000 microsec
which is more consistent with the blog you mentioned. Moving the malloc
out of the outer loop on k
is really important (if you don't, you don't see the cache effect, apparently because malloc
and the underlying mmap
syscall is eating quite a lot of time).
I cannot explain why for k=1
it takes more time (perhaps, because the malloc
-ed memory is taken into RAM by page faults?). Even by adding a for (i=0; i<SIZE/1024; i++) arr[i] = i;
loop to "pre-fetch the pages" before your for (k
loop the time for k=1
is still nearly twice as bigger than for k=2
. We do see the plateau for k=2
to k=16
mentioned in Igor Ostrovsky's blog. Replacing malloc
with calloc
is not very significant. Using clang
(3.2) instead of gcc
(4.8) for compilation gives very similar timing results.
Optimizing is very important , by trying with gcc -Wall -O0 ./wilsonwen.c -o ./wilsonwen-O0
and running that I don't see any plateau (which you'll see even with -O1
). It is well known that gcc
without any optimization flags spit quite poor machine code.
A general rule when benchmarking is to enable the compiler optimizations.
The same as mine.
Someone got the same result in the discussion under that article. And he said maybe it is just the the real CPU time for arr[i] *= 3.
When K = 1 , it needs to run SIZE times. But when K = 2, it only needs to run SIZE/2 times.
So you could rewrite the code to run the same times despite whatever the K is. Then check whether it is the same. I just had this thought while typing. I will try this later. If you have tried , please add a comment.
Here is my code:
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#define SIZE 64*1024*1024
int32_t arr [SIZE];
struct timespec ts;
int main(int argc, char *argv[])
{
long i,j= 0;
long start;
long start_sec;
int count = 1;
int k = 0;
// init the arr;
for (i = 0; i< 64*1024*1024;++i){
arr[i] = 0;
}
for (j = 1; j< 1025;){
clock_gettime(CLOCK_REALTIME, &ts);
start = ts.tv_nsec;
start_sec = ts.tv_sec;
for (i = 0, k = 0; i< 64*1024*1024; i++, k+=j){
k = k & (SIZE -1);
arr[k] *=3;
arr[k] =1;
}
clock_gettime(CLOCK_REALTIME, &ts);
printf ("%d, %ld, %ld\n", count,(ts.tv_sec-start_sec)*1000000000+(ts.tv_nsec -start), j);
count++;
j *= 2;
}
return 0;
}
and the output is below:
1, 352236657, 1
2, 356920027, 2
3, 375986006, 4
4, 494875602, 8
5, 957796009, 16
6, 1397285233, 32
7, 1784398514, 64
8, 1070586859, 128
9, 1130548756, 256
10, 1169113810, 512
11, 1312605482, 1024
If I comment out the arr-init loop, it takes more time than K=2 when K=1.
And we can see that, the time just increases as the expectation(before K = 128). Because we alway have a loop of 64*1024*1024 times despite the K. Greater the K is, more time the cache line will flush.
Well, but I can't explain the decreasing from K = 64 to K = 128.
And @Mysticial talked about the lazy-malloc, so I also did a experiment with the original code in the article, but added the arr-init loop to avoid the lazy-malloc problem. The cost of K = 1 did decrease, but it is still greater than K=2's cost and closer to K=2's cost * 2 than original version. Data is below:
1, 212882204, 1
2, 111660951, 2
3, 67843457, 4
4, 62980310, 8
5, 62092973, 16
6, 42531407, 32
7, 27686909, 64
8, 9142755, 128
9, 4064936, 256
10, 2342842, 512
11, 1130305, 1024
So I think the reason of the decreasing from K = 1 to K = 2 and K=2 to K=4 is the decreasing of the number of iterations from SIZE to SIZE/2.
This is what I think, but I'm not sure.
======================================================
I complied the code with -Ox, the decreasing disappeared (but have to add the arr-init loop). Thanks to @Basile. I will check the differences in the asm code later.
This is the differences between the asm code:
Without O1,
movq $0, -32(%rbp)
jmp .L5
.L6:
movq -32(%rbp), %rax
movl arr(,%rax,4), %edx
movl %edx, %eax
addl %eax, %eax
addl %eax, %edx
movq -32(%rbp), %rax
movl %edx, arr(,%rax,4)
movq -24(%rbp), %rax
addq %rax, -32(%rbp)
.L5:
cmpq $67108863, -32(%rbp)
jle .L6
And with O1,
movl $0, %eax
.L3:
movl arr(,%rax,4), %ecx # ecx = a[i]
leal (%rcx,%rcx,2), %edx # edx = 3* rcx
movl %edx, arr(,%rax,4) # a[i] = edx
addq %rbx, %rax # rax += rbx
cmpq $67108863, %rax
jle .L3
And I change the asm code without O1 to this,
movq $0, -32(%rbp)
movl $0, %eax
movq -24(%rbp), %rbx
jmp .L5
.L6:
movl arr(,%rax,4), %edx
movl %edx, %ecx
addl %ecx, %ecx
addl %ecx, %edx
movl %edx, arr(,%rax,4)
addq %rbx, %rax
.L5:
cmpq $67108863, %rax
jle .L6
Then I get the result:
1, 64119476, 1
2, 63417463, 2
3, 63732534, 4
4, 66703562, 8
5, 65740635, 16
6, 47743618, 32
7, 28402013, 64
8, 9444894, 128
9, 4544371, 256
10, 2991025, 512
11, 1242882, 1024
It's almost just the same as the O1. It seems that the movq -32(%rbp), %rax stuff costs too much. But I don't know why.
Maybe I'd better ask a new question about it.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.