[英]Why does loop fission make sense in this case?
The code without fission looks like this: 没有裂变的代码看起来像这样:
int check(int * res, char * map, int n, int * keys){
int ret = 0;
for(int i = 0; i < n; ++i){
res[ret] = i;
ret += map[hash(keys[i])]
}
return ret;
}
With fission: 裂变:
int check(int * res, char * map, int n, int * keys){
int ret = 0;
for(int i = 0; i < n; ++i){
tmp[i] = map[hash(keys[i])];
}
for(int i = 0; i < n; ++i){
res[ret] = i;
ret += tmp[i];
}
return ret;
}
Notes: 笔记:
The bottleneck is map[hash(keys[i])]
which accesses memory randomly. 瓶颈是
map[hash(keys[i])]
,它随机访问内存。
normally, it would be if(tmp[i]) res[ret++] = i;
通常,
if(tmp[i]) res[ret++] = i;
to avoid the if, I'm using ret += tmp[i]
. 为了避免if,我使用
ret += tmp[i]
。
map[..]
is always 0 or 1 map[..]
始终为0或1
The fission version is usually significantly faster and I am trying to explain why. 裂变版本通常明显更快,我试图解释原因。 My best guess is that
ret += map[..]
still introduces some dependency and that prevents speculative execution. 我最好的猜测是,
ret += map[..]
仍然会引入一些依赖关系并阻止推测性执行。
I would like to hear if anyone has a better explanation. 我想听听是否有人有更好的解释。
From my tests, I get roughly 2x speed difference between the fused and split loops. 根据我的测试,我在融合循环和分离循环之间获得大约2倍的速度差异。 This speed difference is very consistent no matter how I tweak the loop.
无论我如何调整循环,这种速度差异都非常一致。
Fused: 1.096258 seconds
Split: 0.562272 seconds
(Refer to bottom for the full test code.) (有关完整的测试代码,请参阅底部。)
Although I'm not 100% sure, I suspect that this is due to a combination of two things: 虽然我不是百分百肯定,但我怀疑这是由于两件事的结合:
map[gethash(keys[i])]
. map[gethash(keys[i])]
的缓存未命中导致的内存消歧 。 It's obvious that map[gethash(keys[i])]
will result in a cache miss nearly every time. 很明显
map[gethash(keys[i])]
几乎每次都会导致缓存未命中。 In fact, it is probably enough to saturate the entire load-store buffer. 实际上,它可能足以使整个加载存储缓冲区饱和。
Now let's look at the added dependency. 现在让我们看一下添加的依赖项。 The issue is the
ret
variable: 问题是
ret
变量:
int check_fused(int * res, char * map, int n, int * keys){
int ret = 0;
for(int i = 0; i < n; ++i){
res[ret] = i;
ret += map[gethash(keys[i])];
}
return ret;
}
The ret
variable is needed for address resolution of the the store res[ret] = i;
商店
res[ret] = i;
地址解析需要 ret
变量res[ret] = i;
. 。
ret
is coming from a sure cache miss. ret
来自确定的高速缓存未命中。 ret
is coming tmp[i]
- which is much faster. ret
即将到来tmp[i]
- 这要快得多。 This delay in address resolution of the fused loop case likely causes res[ret] = i
to store to clog up the load-store buffer along with map[gethash(keys[i])]
. 融合循环情况的地址解析的这种延迟可能导致
res[ret] = i
存储以阻塞加载存储缓冲区以及map[gethash(keys[i])]
。
Since the load-store buffer has a fixed size, but you have double the junk in it: 由于加载存储缓冲区具有固定的大小,但是它有两倍的垃圾:
You are only able to overlap the cache misses half as much as before. 您只能将缓存未命中的重叠次数减少一半。 Thus 2x slow-down.
因此减速2倍。
Suppose if we changed the fused loop to this: 假设我们将融合循环更改为:
int check_fused(int * res, char * map, int n, int * keys){
int ret = 0;
for(int i = 0; i < n; ++i){
res[i] = i; // Change "res" to "i"
ret += map[gethash(keys[i])];
}
return ret;
}
This will break the address resolution dependency. 这将破坏地址解析依赖性。
(Note that it's not the same anymore, but it's just to demonstrate the performance difference.) (请注意,它不再相同,但它只是为了演示性能差异。)
Then we get similar timings: 然后我们得到类似的时间:
Fused: 0.487477 seconds
Split: 0.574585 seconds
Here's the complete test code: 这是完整的测试代码:
#define SIZE 67108864
unsigned gethash(int key){
return key & (SIZE - 1);
}
int check_fused(int * res, char * map, int n, int * keys){
int ret = 0;
for(int i = 0; i < n; ++i){
res[ret] = i;
ret += map[gethash(keys[i])];
}
return ret;
}
int check_split(int * res, char * map, int n, int * keys, int *tmp){
int ret = 0;
for(int i = 0; i < n; ++i){
tmp[i] = map[gethash(keys[i])];
}
for(int i = 0; i < n; ++i){
res[ret] = i;
ret += tmp[i];
}
return ret;
}
int main()
{
char *map = (char*)calloc(SIZE,sizeof(char));
int *keys = (int*)calloc(SIZE,sizeof(int));
int *res = (int*)calloc(SIZE,sizeof(int));
int *tmp = (int*)calloc(SIZE,sizeof(int));
if (map == NULL || keys == NULL || res == NULL || tmp == NULL){
printf("Memory allocation failed.\n");
system("pause");
return 1;
}
// Generate Random Data
for (int i = 0; i < SIZE; i++){
keys[i] = (rand() & 0xff) | ((rand() & 0xff) << 16);
}
printf("Start...\n");
double start = omp_get_wtime();
int ret;
ret = check_fused(res,map,SIZE,keys);
// ret = check_split(res,map,SIZE,keys,tmp);
double end = omp_get_wtime();
printf("ret = %d",ret);
printf("\n\nseconds = %f\n",end - start);
system("pause");
}
我不认为它是数组索引,但调用函数hash()
可能导致管道停顿并阻止最佳指令重新排序。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.