简体   繁体   中英

c++11 atomic<int>++ much slower than std::mutex protected int++, why?

To compare the performance difference between std::atomic<int> ++ and std::mutex protected int ++, I have this test program:

#include <iostream>
#include <atomic>
#include <mutex>
#include <thread>
#include <chrono>
#include <limits>
using namespace std;
#ifndef INT_MAX
const int INT_MAX = numeric_limits<std::int32_t>::max();
const int INT_MIN = numeric_limits<std::int32_t>::min();
#endif
using std::chrono::steady_clock;
const size_t LOOP_COUNT = 12500000;
const size_t THREAD_COUNT = 8;
int intArray[2] = { 0, INT_MAX };
atomic<int> atomicArray[2];
void atomic_tf() {//3.19s
    for (size_t i = 0; i < LOOP_COUNT; ++i) {
        atomicArray[0]++;
        atomicArray[1]--;
    }
}
mutex m;
void mutex_tf() {//0.25s
    m.lock();
    for (size_t i = 0; i < LOOP_COUNT; ++i) {
        intArray[0]++;
        intArray[1]--;
    }
    m.unlock();
}
int main() {
    {
        atomicArray[0] = 0;
        atomicArray[1] = INT_MAX;
        thread tp[THREAD_COUNT];
        steady_clock::time_point t1 = steady_clock::now();
        for (size_t t = 0; t < THREAD_COUNT; ++t) {
            tp[t] = thread(atomic_tf);
        }
        for (size_t t = 0; t < THREAD_COUNT; ++t) {
            tp[t].join();
        }
        steady_clock::time_point t2 = steady_clock::now();
        cout << (float)((t2 - t1).count()) / 1000000000 << endl;
    }
    {
        thread tp[THREAD_COUNT];
        steady_clock::time_point t1 = steady_clock::now();
        for (size_t t = 0; t < THREAD_COUNT; ++t) {
            tp[t] = thread(mutex_tf);
        }
        for (size_t t = 0; t < THREAD_COUNT; ++t) {
            tp[t].join();
        }
        steady_clock::time_point t2 = steady_clock::now();
        cout << (float)((t2 - t1).count()) / 1000000000 << endl;
    }
    return 0;
}

I ran this program on windows/linux many times (compiled with clang++14, g++12), basically same result.

  1. atomic_tf will take 3+ seconds

  2. mutex_tf will take 0.25+ seconds.

Almost 10 times of performance difference.

My question is, if my test program is valid, then does it indicate that using atomic variable is much more expensive compared with using mutex + normal variables?

How does this performance difference come from? Thanks!

Your mutex version locks the mutex once, then does 12500000 iterations without paying any additional cost for thread synchronization mechanism.

In your atomic version you pay the cost of the atomic synchronization for every increment, and every decrement of the atomic value (each happens 12500000 times).

Therefore your test does not really compare the performance of mutex vs atomic .

If you want to do that, you can try to lock and unlock the mutex for every increment or decrement of the value.

Something like:

void mutex_tf() 
{
    for (size_t i = 0; i < LOOP_COUNT; ++i) 
    {
        m.lock();
        intArray[0]++;
        m.unlock(); 

        m.lock();
        intArray[1]--;
        m.unlock(); 
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM