同時使用非原子和原子操作

Question

我有一個線程池，每個線程包含一個計數器（基本上是TLS）。

需要主線程通過計算所有線程局部計數器的總和來頻繁更新。

大多數情況下，每個線程都會增加自己的計數器，因此不需要同步。

但是在主線程更新的時候，我當然需要某種同步。

我想出了MSVS內在函數（ _InterlockedXXX函數），它表現出了很好的性能（在我的測試中大約0.8秒）然而，它限制了我的代碼到MSVC編譯器和X86 / AMD64平台，但是有一種C ++ - 可移植的方式來實現它？

我嘗試將計數器的int類型更改為std::atomic<int> ，使用std::memory_order_relaxed進行遞增，但此解決方案非常慢！ （~4s）
當使用基本成員std::atomic<T>::_My_val ，我會按照我想要的方式非原子地訪問該值，但它也不可移植，所以問題是相同的......
使用由所有線程共享的單個std::atomic<int>甚至更慢，因為高爭用（~10 s）

你有什么想法嗎？ 也許我應該使用庫（boost）？ 還是寫我自己的課？

Answer 1

std::atomic<int>::fetch_add(1, std::memory_order_relaxed)與_InterlockedIncrement一樣快。

Visual Studio編譯前者以lock add $1 （或等效物），后者編譯lock inc ，但執行時間沒有差別; 在我的系統（Core i5 @ 3.30 GHz）上，每個采用5630 ps / op，大約18.5個周期。

使用Benchpress的 Microbenchmark：

#define BENCHPRESS_CONFIG_MAIN
#include "benchpress/benchpress.hpp"
#include <atomic>
#include <intrin.h>

std::atomic<long> counter;
void f1(std::atomic<long>& counter) { counter.fetch_add(1, std::memory_order_relaxed); }
void f2(std::atomic<long>& counter) { _InterlockedIncrement((long*)&counter); }
BENCHMARK("fetch_add_1", [](benchpress::context* ctx) {
    auto& c = counter; for (size_t i = 0; i < ctx->num_iterations(); ++i) { f1(c); }
})
BENCHMARK("intrin", [](benchpress::context* ctx) {
    auto& c = counter; for (size_t i = 0; i < ctx->num_iterations(); ++i) { f2(c); }
})

輸出：

fetch_add_1                           200000000        5634 ps/op
intrin                                200000000        5637 ps/op

Answer 2

我提出了適合我的這種實現方式。 但是，我找不到編碼semi_atomic<T>::Set()

#include <atomic>

template <class T>
class semi_atomic<T> {
    T Val;
    std::atomic<T> AtomicVal;
    semi_atomic<T>() : Val(0), AtomicVal(0) {}
    // Increment has no need for synchronization.
    inline T Increment() {
        return ++Val;
    }
    // Store the non-atomic Value atomically and return it.
    inline T Get() {
        AtomicVal.store(Val, std::memory_order::memory_order_release);
        return AtomicVal.load(std::memory_order::memory_order_relaxed);
    }
    // Load _Val into Val, but in an atomic way (?)
    inline void Set(T _Val) {
        _InterlockedExchange((volatile long*)&Val, _Val); // And with C++11 ??
    }
}

謝謝你，告訴我是否有問題！

Answer 3

你肯定是對的：每個線程都需要一個std::atomic<int>來實現可移植性，即使它在某種程度上很慢。

但是，在X86和AMD64架構的情況下，它可以（非常）優化。

下面是我得到了什么， sInt是一個簽署32位或64位。

// Here's the magic
inline sInt MyInt::GetValue() {
    return *(volatile sInt*)&Value;
}

// Interlocked intrinsic is atomic
inline void MyInt::SetValue(sInt _Value) {
#ifdef _M_IX86
    _InterlockedExchange((volatile sInt *)&Value, _Value);
#else
    _InterlockedExchange64((volatile sInt *)&Value, _Value);
#endif
}

此代碼將在具有X86體系結構的MSVS中運行（ GetValue()需要）

同時使用非原子和原子操作

問題描述

3 個解決方案

解決方案1
2 2016-04-11 11:27:48

解決方案2
0 2016-04-11 11:24:00

解決方案3
0 2016-04-12 08:05:06

同時使用非原子和原子操作

問題描述

3 個解決方案

解決方案1 2 2016-04-11 11:27:48

解決方案2 0 2016-04-11 11:24:00

解決方案3 0 2016-04-12 08:05:06

解決方案1
2 2016-04-11 11:27:48

解決方案2
0 2016-04-11 11:24:00

解決方案3
0 2016-04-12 08:05:06