[英]Why is this sequential array loop slower than a loop that uses a “lookup” array?
I've been studying cache locality recently and I'm trying to understand how CPUs access memory. 我最近一直在研究缓存局部性,我试图了解CPU如何访问内存。 I wrote an experiment to see if there was a performance difference when looping an array sequentially vs. using a lookup table of some sort to index into the data array.
我写了一个实验,看看在顺序循环数组时是否存在性能差异,而使用某种查找表来索引数据数组。 I was surprised to find the lookup method slightly faster.
我很惊讶地发现查找方法稍快一些。 My code is below.
我的代码如下。 I compiled with GCC on Windows (MinGW).
我在Windows上用GCC编译(MinGW)。
#include <stdlib.h>
#include <stdio.h>
#include <windows.h>
int main()
{
DWORD dwElapsed, dwStartTime;
//random arrangement of keys to lookup
int lookup_arr[] = {0, 3, 8, 7, 2, 1, 4, 5, 6, 9};
//data for both loops
int data_arr1[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
int data_arr2[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
//first loop, sequential access
dwStartTime = GetTickCount();
for (int n = 0; n < 9000000; n++) {
for (int i = 0; i < 10; i++)
data_arr1[i]++;
}
dwElapsed = GetTickCount() - dwStartTime;
printf("Normal loop completed: %d\n", dwElapsed);
//second loop, indexes into data_arr2 using the lookup array
dwStartTime = GetTickCount();
for (int n = 0; n < 9000000; n++) {
for (int i = 0; i < 10; i++)
data_arr2[lookup_arr[i]]++;
}
dwElapsed = GetTickCount() - dwStartTime;
printf("Lookup loop completed: %d\n", dwElapsed);
return 0;
}
Running this, I get: 运行这个,我得到:
Normal loop completed: 375
Lookup loop completed: 297
Following up on my earlier comments, here is how you do this kind of thing. 按照我之前的评论,这是你如何做这件事。
The result is a significant difference in speed with the "randomized indexing". 结果是速度与“随机索引”有显着差异。
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <math.h>
#define N 1000000
int main(void) {
int *rArr;
int *rInd; // randomized indices
int *lInd; // linear indices
int ii;
rArr = malloc(N * sizeof(int) );
rInd = malloc(N * sizeof(int) );
lInd = malloc(N * sizeof(int) );
for(ii = 0; ii < N; ii++) {
lInd[ii] = ii;
rArr[ii] = rand();
rInd[ii] = rand()%N;
}
int loopCount;
int sum;
time_t startT, stopT;
double dt, totalT=0, tt2=0;
startT = clock();
for(loopCount = 0; loopCount < 100; loopCount++) {
for(ii = 0; ii < N; ii++) {
sum += rArr[lInd[ii]];
}
stopT = clock();
dt = stopT - startT;
totalT += dt;
tt2 += dt * dt;
startT = stopT;
}
printf("sum is %d\n", sum);
printf("total time: %lf += %lf\n", totalT/(double)(CLOCKS_PER_SEC), (tt2 - totalT * totalT / 100.0)/100.0 / (double)(CLOCKS_PER_SEC));
totalT = 0; tt2 = 0;
startT = clock();
for(loopCount = 0; loopCount < 100; loopCount++) {
for(ii = 0; ii < N; ii++) {
sum += rArr[rInd[ii]];
}
stopT = clock();
dt = stopT - startT;
totalT += dt;
tt2 += dt * dt;
startT = stopT;
}
printf("sum is %d\n", sum);
printf("total time: %lf += %lf\n", totalT/(double)(CLOCKS_PER_SEC), sqrt((tt2 - totalT * totalT / 100.0)/100.0) / (double)(CLOCKS_PER_SEC));
}
Result - the sequential access is > 2x faster (on my machine): 结果 - 顺序访问速度提高了2倍(在我的机器上):
sum is -1444272372
total time: 0.396539 += 0.000219
sum is 546230204
total time: 0.756407 += 0.001165
With -O3
optimization, the difference is even starker - a full 3x faster: 通过
-O3
优化,差异甚至更加明显 - 快3倍:
sum is -318372465
total time: 0.142444 += 0.013230
sum is 1672130111
total time: 0.455804 += 0.000402
I believe you are compiling without optimizations turned on. 我相信你正在编译而没有打开优化。 With
-O2
g++ optimizes away everything so the run time is 0, and without the flag I get similar results. 使用
-O2
g ++可以优化所有内容,因此运行时间为0,没有标志我得到类似的结果。
After modifying the program so that values in data_arr1
and data_arr2
are actually used for something I get 78ms for both. 在修改程序以便
data_arr1
和data_arr2
中的值实际用于某些东西时,我得到78ms。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.