為什么循環裂變在這種情況下有意義？

Question

沒有裂變的代碼看起來像這樣：

int check(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += map[hash(keys[i])]
    }
    return ret;
}

裂變：

int check(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        tmp[i] = map[hash(keys[i])];
    }
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += tmp[i];
    }
    return ret;
}

筆記：

瓶頸是map[hash(keys[i])] ，它隨機訪問內存。
通常， if(tmp[i]) res[ret++] = i; 為了避免if，我使用ret += tmp[i] 。
map[..]始終為0或1

裂變版本通常明顯更快，我試圖解釋原因。 我最好的猜測是， ret += map[..]仍然會引入一些依賴關系並阻止推測性執行。

我想聽聽是否有人有更好的解釋。

Answer 1

根據我的測試，我在融合循環和分離循環之間獲得大約2倍的速度差異。 無論我如何調整循環，這種速度差異都非常一致。

Fused: 1.096258 seconds
Split: 0.562272 seconds

（有關完整的測試代碼，請參閱底部。）

雖然我不是百分百肯定，但我懷疑這是由於兩件事的結合：

加載 - 存儲緩沖區的飽和度，用於因map[gethash(keys[i])]的緩存未命中導致的內存消歧。
融合循環版本中添加的依賴項。

很明顯map[gethash(keys[i])]幾乎每次都會導致緩存未命中。 實際上，它可能足以使整個加載存儲緩沖區飽和。

現在讓我們看一下添加的依賴項。 問題是ret變量：

int check_fused(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += map[gethash(keys[i])];
    }
    return ret;
}

商店res[ret] = i; 地址解析需要 ret變量res[ret] = i; 。

在融合循環中， ret來自確定的高速緩存未命中。
在分裂循環中， ret即將到來tmp[i] - 這要快得多。

融合循環情況的地址解析的這種延遲可能導致res[ret] = i存儲以阻塞加載存儲緩沖區以及map[gethash(keys[i])] 。

由於加載存儲緩沖區具有固定的大小，但是它有兩倍的垃圾：
您只能將緩存未命中的重疊次數減少一半。 因此減速2倍。

假設我們將融合循環更改為：

int check_fused(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        res[i] = i;    //  Change "res" to "i"
        ret += map[gethash(keys[i])];
    }
    return ret;
}

這將破壞地址解析依賴性。

^{（請注意，它不再相同，但它只是為了演示性能差異。）}

然后我們得到類似的時間：

Fused: 0.487477 seconds
Split: 0.574585 seconds

這是完整的測試代碼：

#define SIZE 67108864

unsigned gethash(int key){
    return key & (SIZE - 1);
}

int check_fused(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += map[gethash(keys[i])];
    }
    return ret;
}
int check_split(int * res, char * map, int n, int * keys, int *tmp){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        tmp[i] = map[gethash(keys[i])];
    }
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += tmp[i];
    }
    return ret;
}


int main()
{
    char *map = (char*)calloc(SIZE,sizeof(char));
    int *keys =  (int*)calloc(SIZE,sizeof(int));
    int *res  =  (int*)calloc(SIZE,sizeof(int));
    int *tmp  =  (int*)calloc(SIZE,sizeof(int));
    if (map == NULL || keys == NULL || res == NULL || tmp == NULL){
        printf("Memory allocation failed.\n");
        system("pause");
        return 1;
    }

    //  Generate Random Data
    for (int i = 0; i < SIZE; i++){
        keys[i] = (rand() & 0xff) | ((rand() & 0xff) << 16);
    }

    printf("Start...\n");

    double start = omp_get_wtime();
    int ret;

    ret = check_fused(res,map,SIZE,keys);
//    ret = check_split(res,map,SIZE,keys,tmp);

    double end = omp_get_wtime();

    printf("ret = %d",ret);
    printf("\n\nseconds = %f\n",end - start);

    system("pause");
}

Answer 2

我不認為它是數組索引，但調用函數hash()可能導致管道停頓並阻止最佳指令重新排序。

為什么循環裂變在這種情況下有意義？

問題描述

2 個解決方案

解決方案1
8 已采納 2012-06-20 17:47:17

解決方案2
1 2012-06-20 16:13:36

為什么循環裂變在這種情況下有意義？

問題描述

2 個解決方案

解決方案1 8 已采納 2012-06-20 17:47:17

解決方案2 1 2012-06-20 16:13:36

解決方案1
8 已采納 2012-06-20 17:47:17

解決方案2
1 2012-06-20 16:13:36