简体   繁体   中英

Use non-atomic and atomic operations at the same time

I have a pool of threads, each thread contains a counter (it is TLS basically).

A master thread is required to update frequently by computing the sum of all thread-local counters.

Most of the time, each thread increments its own counter, so no synchronization is needed.

But at the time the master thread is updating, I of course need some kind of synchronization.

I came up with MSVS intrinsics ( _InterlockedXXX functions), and it showed great performance (~ 0.8 s on my test) However, it limits my code to MSVC compilers and X86/AMD64 platforms, but is there a C++-portable way to do it ?

  • I tried changing the int type to std::atomic<int> for the counter, using std::memory_order_relaxed for the incrementations but this solution is very slow ! (~ 4s)

  • When using the base member std::atomic<T>::_My_val , the value is accessed non-atomically as I would like to, but it is not portable as well so the problem is the same...

  • Using a single std::atomic<int> shared by all threads is even slower, due to high contention (~ 10 s)

Do you have some ideas? Perhaps should I use a library (boost)? Or write my own class?

std::atomic<int>::fetch_add(1, std::memory_order_relaxed) is just as fast as _InterlockedIncrement .

Visual Studio compiles the former to lock add $1 (or equivalent) and the latter to lock inc , but there is no difference in execution time; on my system (Core i5 @3.30 GHz) each take 5630 ps/op, around 18.5 cycles.

Microbenchmark using Benchpress :

#define BENCHPRESS_CONFIG_MAIN
#include "benchpress/benchpress.hpp"
#include <atomic>
#include <intrin.h>

std::atomic<long> counter;
void f1(std::atomic<long>& counter) { counter.fetch_add(1, std::memory_order_relaxed); }
void f2(std::atomic<long>& counter) { _InterlockedIncrement((long*)&counter); }
BENCHMARK("fetch_add_1", [](benchpress::context* ctx) {
    auto& c = counter; for (size_t i = 0; i < ctx->num_iterations(); ++i) { f1(c); }
})
BENCHMARK("intrin", [](benchpress::context* ctx) {
    auto& c = counter; for (size_t i = 0; i < ctx->num_iterations(); ++i) { f2(c); }
})

Output:

fetch_add_1                           200000000        5634 ps/op
intrin                                200000000        5637 ps/op

I came up with this kind of implementation which suits me. However, I can't find a way to code semi_atomic<T>::Set()

#include <atomic>

template <class T>
class semi_atomic<T> {
    T Val;
    std::atomic<T> AtomicVal;
    semi_atomic<T>() : Val(0), AtomicVal(0) {}
    // Increment has no need for synchronization.
    inline T Increment() {
        return ++Val;
    }
    // Store the non-atomic Value atomically and return it.
    inline T Get() {
        AtomicVal.store(Val, std::memory_order::memory_order_release);
        return AtomicVal.load(std::memory_order::memory_order_relaxed);
    }
    // Load _Val into Val, but in an atomic way (?)
    inline void Set(T _Val) {
        _InterlockedExchange((volatile long*)&Val, _Val); // And with C++11 ??
    }
}

Thank you and tell me if something is wrong !

You are definitely right : a std::atomic<int> per thread is needed for portability, even if it is somehow slow.

However, it can be (very) optimized in the case of X86 and AMD64 architectures.

Here's what I got, sInt being a signed 32- or 64- bit.

// Here's the magic
inline sInt MyInt::GetValue() {
    return *(volatile sInt*)&Value;
}

// Interlocked intrinsic is atomic
inline void MyInt::SetValue(sInt _Value) {
#ifdef _M_IX86
    _InterlockedExchange((volatile sInt *)&Value, _Value);
#else
    _InterlockedExchange64((volatile sInt *)&Value, _Value);
#endif
}

This code will work in MSVS with a X86 architecture (needed for GetValue() )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM