为什么循环裂变在这种情况下有意义？

Question

The code without fission looks like this: 没有裂变的代码看起来像这样：

int check(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += map[hash(keys[i])]
    }
    return ret;
}

With fission: 裂变：

int check(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        tmp[i] = map[hash(keys[i])];
    }
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += tmp[i];
    }
    return ret;
}

Notes: 笔记：

The bottleneck is map[hash(keys[i])] which accesses memory randomly. 瓶颈是map[hash(keys[i])] ，它随机访问内存。
normally, it would be if(tmp[i]) res[ret++] = i; 通常， if(tmp[i]) res[ret++] = i; to avoid the if, I'm using ret += tmp[i] . 为了避免if，我使用ret += tmp[i] 。
map[..] is always 0 or 1 map[..]始终为0或1

The fission version is usually significantly faster and I am trying to explain why. 裂变版本通常明显更快，我试图解释原因。 My best guess is that ret += map[..] still introduces some dependency and that prevents speculative execution. 我最好的猜测是， ret += map[..]仍然会引入一些依赖关系并阻止推测性执行。

I would like to hear if anyone has a better explanation. 我想听听是否有人有更好的解释。

Answer 1

From my tests, I get roughly 2x speed difference between the fused and split loops. 根据我的测试，我在融合循环和分离循环之间获得大约2倍的速度差异。 This speed difference is very consistent no matter how I tweak the loop. 无论我如何调整循环，这种速度差异都非常一致。

Fused: 1.096258 seconds
Split: 0.562272 seconds

(Refer to bottom for the full test code.) （有关完整的测试代码，请参阅底部。）

Although I'm not 100% sure, I suspect that this is due to a combination of two things: 虽然我不是百分百肯定，但我怀疑这是由于两件事的结合：

Saturation of the load-store buffer for memory disambigutation due to the cache misses from map[gethash(keys[i])] . 加载 - 存储缓冲区的饱和度，用于因map[gethash(keys[i])]的缓存未命中导致的内存消歧。
An added dependency in the fused loop version. 融合循环版本中添加的依赖项。

It's obvious that map[gethash(keys[i])] will result in a cache miss nearly every time. 很明显map[gethash(keys[i])]几乎每次都会导致缓存未命中。 In fact, it is probably enough to saturate the entire load-store buffer. 实际上，它可能足以使整个加载存储缓冲区饱和。

Now let's look at the added dependency. 现在让我们看一下添加的依赖项。 The issue is the ret variable: 问题是ret变量：

int check_fused(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += map[gethash(keys[i])];
    }
    return ret;
}

The ret variable is needed for address resolution of the the store res[ret] = i; 商店res[ret] = i; 地址解析需要 ret变量res[ret] = i; . 。

In the fused loop, ret is coming from a sure cache miss. 在融合循环中， ret来自确定的高速缓存未命中。
In the split loop, ret is coming tmp[i] - which is much faster. 在分裂循环中， ret即将到来tmp[i] - 这要快得多。

This delay in address resolution of the fused loop case likely causes res[ret] = i to store to clog up the load-store buffer along with map[gethash(keys[i])] . 融合循环情况的地址解析的这种延迟可能导致res[ret] = i存储以阻塞加载存储缓冲区以及map[gethash(keys[i])] 。

Since the load-store buffer has a fixed size, but you have double the junk in it: 由于加载存储缓冲区具有固定的大小，但是它有两倍的垃圾：
You are only able to overlap the cache misses half as much as before. 您只能将缓存未命中的重叠次数减少一半。 Thus 2x slow-down. 因此减速2倍。

Suppose if we changed the fused loop to this: 假设我们将融合循环更改为：

int check_fused(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        res[i] = i;    //  Change "res" to "i"
        ret += map[gethash(keys[i])];
    }
    return ret;
}

This will break the address resolution dependency. 这将破坏地址解析依赖性。

^{(Note that it's not the same anymore, but it's just to demonstrate the performance difference.)} ^{（请注意，它不再相同，但它只是为了演示性能差异。）}

Then we get similar timings: 然后我们得到类似的时间：

Fused: 0.487477 seconds
Split: 0.574585 seconds

Here's the complete test code: 这是完整的测试代码：

#define SIZE 67108864

unsigned gethash(int key){
    return key & (SIZE - 1);
}

int check_fused(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += map[gethash(keys[i])];
    }
    return ret;
}
int check_split(int * res, char * map, int n, int * keys, int *tmp){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        tmp[i] = map[gethash(keys[i])];
    }
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += tmp[i];
    }
    return ret;
}


int main()
{
    char *map = (char*)calloc(SIZE,sizeof(char));
    int *keys =  (int*)calloc(SIZE,sizeof(int));
    int *res  =  (int*)calloc(SIZE,sizeof(int));
    int *tmp  =  (int*)calloc(SIZE,sizeof(int));
    if (map == NULL || keys == NULL || res == NULL || tmp == NULL){
        printf("Memory allocation failed.\n");
        system("pause");
        return 1;
    }

    //  Generate Random Data
    for (int i = 0; i < SIZE; i++){
        keys[i] = (rand() & 0xff) | ((rand() & 0xff) << 16);
    }

    printf("Start...\n");

    double start = omp_get_wtime();
    int ret;

    ret = check_fused(res,map,SIZE,keys);
//    ret = check_split(res,map,SIZE,keys,tmp);

    double end = omp_get_wtime();

    printf("ret = %d",ret);
    printf("\n\nseconds = %f\n",end - start);

    system("pause");
}

Answer 2

我不认为它是数组索引，但调用函数hash()可能导致管道停顿并阻止最佳指令重新排序。

为什么循环裂变在这种情况下有意义？

问题描述

2 个解决方案

解决方案1
8 已采纳 2012-06-20 17:47:17

解决方案2
1 2012-06-20 16:13:36

为什么循环裂变在这种情况下有意义？

问题描述

2 个解决方案

解决方案1 8 已采纳 2012-06-20 17:47:17

解决方案2 1 2012-06-20 16:13:36

解决方案1
8 已采纳 2012-06-20 17:47:17

解决方案2
1 2012-06-20 16:13:36