無鎖多線程比單線程程序慢嗎？

Question

我考慮過並行化程序，以便在第一階段將項目分組到以並行工作者數量為模的桶中，這樣可以避免第二階段的沖突。 並行程序的每個線程使用std::atomic::fetch_add在輸出數組中保留一個位置，然后使用std::atomic::compare_exchange_weak更新當前的桶頭指針。 所以它是免費的。 但是，我懷疑多個線程在單個原子上的性能（我們做fetch_add ，因為桶頭數等於線程數，因此平均沒有太大的爭用），所以我決定測量這個。 這是代碼：

#include <atomic>
#include <chrono>
#include <cstdio>
#include <string>
#include <thread>
#include <vector>

std::atomic<int64_t> gCounter(0);
const int64_t gnAtomicIterations = 10 * 1000 * 1000;

void CountingThread() {
  for (int64_t i = 0; i < gnAtomicIterations; i++) {
    gCounter.fetch_add(1, std::memory_order_acq_rel);
  }
}

void BenchmarkAtomic() {
  const uint32_t maxThreads = std::thread::hardware_concurrency();
  std::vector<std::thread> thrs;
  thrs.reserve(maxThreads + 1);

  for (uint32_t nThreads = 1; nThreads <= maxThreads; nThreads++) {
    auto start = std::chrono::high_resolution_clock::now();
    for (uint32_t i = 0; i < nThreads; i++) {
      thrs.emplace_back(CountingThread);
    }
    for (uint32_t i = 0; i < nThreads; i++) {
      thrs[i].join();
    }
    auto elapsed = std::chrono::high_resolution_clock::now() - start;
    double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
    printf("%d threads: %.3lf Ops/sec, counter=%lld\n", (int)nThreads, (nThreads * gnAtomicIterations) / nSec,
      (long long)gCounter.load(std::memory_order_acquire));

    thrs.clear();
    gCounter.store(0, std::memory_order_release);
  }
}

int __cdecl main() {
  BenchmarkAtomic();
  return 0;
}

這是輸出：

1 threads: 150836387.770 Ops/sec, counter=10000000
2 threads: 91198022.827 Ops/sec, counter=20000000
3 threads: 78989357.501 Ops/sec, counter=30000000
4 threads: 66808858.187 Ops/sec, counter=40000000
5 threads: 68732962.817 Ops/sec, counter=50000000
6 threads: 64296828.452 Ops/sec, counter=60000000
7 threads: 66575046.721 Ops/sec, counter=70000000
8 threads: 64487317.763 Ops/sec, counter=80000000
9 threads: 63598622.030 Ops/sec, counter=90000000
10 threads: 62666457.778 Ops/sec, counter=100000000
11 threads: 62341701.668 Ops/sec, counter=110000000
12 threads: 62043591.828 Ops/sec, counter=120000000
13 threads: 61933752.800 Ops/sec, counter=130000000
14 threads: 62063367.585 Ops/sec, counter=140000000
15 threads: 61994384.135 Ops/sec, counter=150000000
16 threads: 61760299.784 Ops/sec, counter=160000000

CPU是8核，16線程（Ryzen 1800X @ 3.9Ghz）。 因此，每秒操作的所有操作線程的總數會急劇減少，直到使用4個線程。 然后它緩慢下降並稍微波動。

那么這種現象在其他CPU和編譯器中是否常見？ 有沒有解決方法（除了訴諸單個線程）？

Answer 1

無鎖多線程程序並不比單線程程序慢。 什么使它變慢是數據爭用。 您提供的示例實際上是一個備受爭議的人工程序。 在實際程序中，您將在每次訪問共享數據之間執行大量工作，因此它將具有較少的緩存失效等等。 Jeff Preshing的這篇CppCon演講可以比我更好地解釋你的一些問題。

添加：嘗試修改CountingThread並偶爾添加一個睡眠假裝你正在忙於其他東西，而不是遞增原子變量gCounter。 然后繼續在if語句中使用值來查看它將如何影響程序的結果。

void CountingThread() {
  for (int64_t i = 0; i < gnAtomicIterations; i++) {
    // take a nap every 10000th iteration to simulate work on something
    // unrelated to access to shared resource
    if (i%10000 == 0) {
        std::chrono::milliseconds timespan(1);
        std::this_thread::sleep_for(timespan);
    }
    gCounter.fetch_add(1, std::memory_order_acq_rel);
  }
}

通常，每次調用gCounter.fetch_add ，都意味着在其他核心的緩存中標記該數據無效。 它迫使他們將數據擴展到遠離核心的緩存中。 此效果是導致程序性能下降的主要原因。

 local L1 CACHE hit, ~4 cycles ( 2.1 - 1.2 ns ) local L2 CACHE hit, ~10 cycles ( 5.3 - 3.0 ns ) local L3 CACHE hit, line unshared ~40 cycles ( 21.4 - 12.0 ns ) local L3 CACHE hit, shared line in another core ~65 cycles ( 34.8 - 19.5 ns ) local L3 CACHE hit, modified in another core ~75 cycles ( 40.2 - 22.5 ns ) remote L3 CACHE (Ref: Fig.1 [Pg. 5]) ~100-300 cycles ( 160.7 - 30.0 ns ) local DRAM ~60 ns remote DRAM ~100 ns

上表取自訪問各種緩存和主內存的近似成本？

無鎖並不意味着您可以在沒有成本的情況下在線程之間交換數據。 無鎖意味着您不必等待其他線程解鎖互斥鎖以便您讀取共享數據。 事實上，即使是無鎖程序也使用鎖定機制來防止數據損壞。

只需按照簡單的規則。 嘗試盡可能少地訪問共享數據，以從多核編程中獲得更多收益。

Answer 2

這取決於具體的工作量。

參見amdahl定律

                     100 % (whole workload in percentage)
speedup =  -----------------------------------------------------------
            (sequential work load in %) + (parallel workload in %) / (count of workers)

程序中的並行工作負載為0 % ，因此加速比為1 。 阿卡沒有加速。 （您正在同步遞增相同的內存單元，因此在任何給定時間只有一個線程可以遞增單元格。）

粗略的解釋，為什么它甚至表現更差然后speedup=1 ：

包含gCounter的緩存行僅在一個線程中保留在cpu緩存中。

對於計划到不同cpus或核心的多個線程，包含gCounter的緩存行將圍繞cpus ore核心的不同緩存進行反彈。

因此，與為每個增量操作訪問存儲器相比，差異有點類似於僅使用一個線程遞增寄存器。 （有時它比內存訪問更快，因為在現代cpu架構中有緩存來緩存傳輸。）

Answer 3

最喜歡的非常廣泛的速度更快問題，唯一的完全通用的答案是它依賴 。

一個好的心理模型是，當並行化現有任務時， N線程上的並行版本的運行時將由三個貢獻組成：

串行和並行算法共同的靜止串行部分。 IE中。 沒有並行化的工作，例如設置或拆除工作，或者沒有並行運行的工作，因為任務被不精確地划分¹ 。
在N工人中有效並行化的平行部分。
一個開銷組件，表示在並行算法中完成的額外工作，在串行版本中不存在。 幾乎總是有一些小的開銷來分配工作，委托給工作線程並組合結果，但在某些情況下，開銷可能會淹沒實際工作。

所以一般來說你有這三個貢獻，並分別分配T1p ， T2p和T3p 。 現在T1p組件存在並且在串行和並行算法中花費相同的時間，因此我們可以忽略它，因為它為了確定哪個更慢而取消。

當然，如果您使用粗粒度同步，例如，在每個線程上遞增局部變量並且僅定期（可能僅在最后一次）更新共享變量，情況將會逆轉。

¹這還包括工作負載分區良好的情況，但有些線程每單位時間完成的工作量更多，這在現代CPU和現代操作系統中很常見。

無鎖多線程比單線程程序慢嗎？

問題描述

3 個解決方案

解決方案1
3 2017-06-15 07:49:38

解決方案2
2 2017-06-15 07:42:38

解決方案3
0 2017-07-03 00:41:16

無鎖多線程比單線程程序慢嗎？

問題描述

3 個解決方案

解決方案1 3 2017-06-15 07:49:38

解決方案2 2 2017-06-15 07:42:38

解決方案3 0 2017-07-03 00:41:16

解決方案1
3 2017-06-15 07:49:38

解決方案2
2 2017-06-15 07:42:38

解決方案3
0 2017-07-03 00:41:16