简体   繁体   English

为什么这个顺序数组循环比使用“查找”数组的循环慢?

[英]Why is this sequential array loop slower than a loop that uses a “lookup” array?

I've been studying cache locality recently and I'm trying to understand how CPUs access memory. 我最近一直在研究缓存局部性,我试图了解CPU如何访问内存。 I wrote an experiment to see if there was a performance difference when looping an array sequentially vs. using a lookup table of some sort to index into the data array. 我写了一个实验,看看在顺序循环数组时是否存在性能差异,而使用某种查找表来索引数据数组。 I was surprised to find the lookup method slightly faster. 我很惊讶地发现查找方法稍快一些。 My code is below. 我的代码如下。 I compiled with GCC on Windows (MinGW). 我在Windows上用GCC编译(MinGW)。

#include <stdlib.h>
#include <stdio.h>
#include <windows.h>

int main()
{
    DWORD dwElapsed, dwStartTime;

    //random arrangement of keys to lookup
    int lookup_arr[] = {0, 3, 8, 7, 2, 1, 4, 5, 6, 9};

    //data for both loops
    int data_arr1[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    int data_arr2[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};

    //first loop, sequential access
    dwStartTime = GetTickCount();
    for (int n = 0; n < 9000000; n++) {
        for (int i = 0; i < 10; i++)
            data_arr1[i]++;
    }
    dwElapsed = GetTickCount() - dwStartTime;
    printf("Normal loop completed: %d\n", dwElapsed);

    //second loop, indexes into data_arr2 using the lookup array
    dwStartTime = GetTickCount();
    for (int n = 0; n < 9000000; n++) {
        for (int i = 0; i < 10; i++)
            data_arr2[lookup_arr[i]]++;
    }
    dwElapsed = GetTickCount() - dwStartTime;
    printf("Lookup loop completed: %d\n", dwElapsed);

    return 0;
}

Running this, I get: 运行这个,我得到:

Normal loop completed: 375
Lookup loop completed: 297

Following up on my earlier comments, here is how you do this kind of thing. 按照我之前的评论,这是你如何做这件事。

  1. Repeated measurements 重复测量
  2. Estimate error 估计错误
  3. Large memory block 大内存块
  4. Randomized vs linear indices (so either way you have an indirection) 随机与线性指数(所以无论哪种方式都有间接)

The result is a significant difference in speed with the "randomized indexing". 结果是速度与“随机索引”有显着差异。

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <math.h>

#define N 1000000

int main(void) {
  int *rArr;
  int *rInd; // randomized indices
  int *lInd; // linear indices
  int ii;

  rArr = malloc(N * sizeof(int) );
  rInd = malloc(N * sizeof(int) );
  lInd = malloc(N * sizeof(int) );

  for(ii = 0; ii < N; ii++) {
    lInd[ii] = ii;
    rArr[ii] = rand();
    rInd[ii] = rand()%N;
  }

  int loopCount;
  int sum;
  time_t startT, stopT;
  double dt, totalT=0, tt2=0;

  startT = clock();
  for(loopCount = 0; loopCount < 100; loopCount++) {
    for(ii = 0; ii < N; ii++) {
      sum += rArr[lInd[ii]];
    }
    stopT = clock();
    dt = stopT - startT;
    totalT += dt;
    tt2 += dt * dt;
    startT = stopT;
  }
  printf("sum is %d\n", sum);
  printf("total time: %lf += %lf\n", totalT/(double)(CLOCKS_PER_SEC), (tt2 - totalT * totalT / 100.0)/100.0 / (double)(CLOCKS_PER_SEC));

  totalT = 0; tt2 = 0;
  startT = clock();
  for(loopCount = 0; loopCount < 100; loopCount++) {
    for(ii = 0; ii < N; ii++) {
      sum += rArr[rInd[ii]];
    }
    stopT = clock();
    dt = stopT - startT;
    totalT += dt;
    tt2 += dt * dt;
    startT = stopT;
  }
  printf("sum is %d\n", sum);
  printf("total time: %lf += %lf\n", totalT/(double)(CLOCKS_PER_SEC), sqrt((tt2 - totalT * totalT / 100.0)/100.0) / (double)(CLOCKS_PER_SEC));
}

Result - the sequential access is > 2x faster (on my machine): 结果 - 顺序访问速度提高了2倍(在我的机器上):

sum is -1444272372
total time: 0.396539 += 0.000219
sum is 546230204
total time: 0.756407 += 0.001165

With -O3 optimization, the difference is even starker - a full 3x faster: 通过-O3优化,差异甚至更加明显 - 快3倍:

sum is -318372465
total time: 0.142444 += 0.013230
sum is 1672130111
total time: 0.455804 += 0.000402

I believe you are compiling without optimizations turned on. 我相信你正在编译而没有打开优化。 With -O2 g++ optimizes away everything so the run time is 0, and without the flag I get similar results. 使用-O2 g ++可以优化所有内容,因此运行时间为0,没有标志我得到类似的结果。

After modifying the program so that values in data_arr1 and data_arr2 are actually used for something I get 78ms for both. 在修改程序以便data_arr1data_arr2中的值实际用于某些东西时,我得到78ms。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM