數組的並行求和比C ++中的序列求和要慢

Question

我用C ++ std :: thread編寫了數組並行求和的代碼。 但平行和需要0.6s，順序總和需要0.3s。

我不認為這段代碼在arr或ret上進行任何同步。

為什么會出現這種情況？

我的CPU是i7-8700，有6個物理內核。

#include <stdio.h>
#include <ctime>
#include <thread>

// Constants
#define THREADS 4
#define ARR_SIZE 200000000
int ret[THREADS];

// Function for thread.
void parallel_sum(int *arr, int thread_id) {
    int s = ARR_SIZE / THREADS * thread_id, e = ARR_SIZE / THREADS * (thread_id + 1);
    printf("%d, %d\n", s, e);
    for (int i = s; i < e; i++) ret[thread_id] += arr[i];
}

int main() {

    // Variable definitions
    int *arr = new int[ARR_SIZE]; // 1 billion

    time_t t1, t2; // Variable for time consuming checking
    std::thread *threads = new std::thread[THREADS];

    // Initialization
    for (int i = 0; i < ARR_SIZE; i++) arr[i] = 1;
    for (int i = 0; i < THREADS; i++) ret[i] = 0;
    long long int sum = 0;

    // Parallel sum start
    t1 = clock();
    for (int i = 0; i < THREADS; i++) threads[i] = std::thread(parallel_sum, arr, i);
    for (int i = 0; i < THREADS; i++) threads[i].join();
    t2 = clock();

    for (int i = 0; i < THREADS; i++) sum += ret[i];
    printf("[%lf] Parallel sum %lld \n", (float)(t2 - t1) / (float)CLOCKS_PER_SEC, sum);
    // Parallel sum end


    sum = 0; // Initialization


    // Sequential sum start
    t1 = clock();
    for (int i = 0; i < ARR_SIZE; i++) sum += arr[i];
    t2 = clock();

    printf("[%lf] Sequential sum %lld \n", (float)(t2 - t1) / (float)CLOCKS_PER_SEC, sum);
    // Sequential sum end


    return 0;
}

Answer 1

for (int i = s; i < e; i++) ret[thread_id] += arr[i];

這會導致大量緩存爭用，因為ret陣列的元素可能共享相同的緩存行。 它通常被稱為虛假共享 。

一個簡單的解決方法是使用輔助（線程）局部變量進行循環更新，最后增加共享計數器，例如：

int temp = 0;
for (int i = s; i < e; i++) temp += arr[i];
ret[thread_id] += temp;

或者，最好使用std::atomic<int>類型的單個全局ret來實現多線程和。 然后，你可以簡單地寫：

int temp = 0;
for (int i = s; i < e; i++) temp += arr[i];
ret += temp;

或者，更有效率：

ret.fetch_add(temp, std::memory_order_relaxed);

Answer 2

啟用編譯器優化（沒有任何其他方式的基准測試點），我得到以下結果：

[0.093481]並行總和200000000
[0.073333]順序總和200000000

請注意，我們已記錄兩種情況下的總 CPU消耗。 並行總和使用更多的總CPU並不奇怪，因為它必須啟動線程並聚合結果。 並行版本使用更多的CPU時間，因為它還有更多工作要做。

你沒有記錄牆上的時間，但很可能是因為四個核心有助於完成這項工作，因此並行情況下的掛壁時間可能較少。 添加代碼以記錄經過的時間，顯示並行版本大約使用串行版本的一半時間。 至少，在我的機器上有合理的編譯器優化設置。

數組的並行求和比C ++中的序列求和要慢

問題描述

2 個解決方案

解決方案1
6 已采納 2019-01-14 06:50:00

解決方案2
4 2019-01-14 06:48:08

數組的並行求和比C ++中的序列求和要慢

問題描述

2 個解決方案

解決方案1 6 已采納 2019-01-14 06:50:00

解決方案2 4 2019-01-14 06:48:08

解決方案1
6 已采納 2019-01-14 06:50:00

解決方案2
4 2019-01-14 06:48:08