简体   繁体   中英

OpenMP critical performance better than atomic

I'm trying a code from https://github.com/joeladams/patternlets/blob/master/patternlets/openMP/14.mutualExclusion-critical2/critical2.c to prove that Critical is more time-expensive, but I keep getting the result in which critical has a faster execution time than Atomic. Anyone knows how does it happen?

// simulate many deposits using atomic
startTime = omp_get_wtime();
#pragma omp parallel for 
for (i = 0; i < REPS; i++) {
    #pragma omp atomic
    balance += 1.0;
}
stopTime = omp_get_wtime();
atomicTime = stopTime - startTime;
print("atomic", REPS, balance, atomicTime, atomicTime/REPS);


// simulate the same number of deposits using critical
balance = 0.0;
startTime = omp_get_wtime();
#pragma omp parallel for 
for (i = 0; i < REPS; i++) {
     #pragma omp critical
     {
         balance += 1.0;
     }
}
stopTime = omp_get_wtime();
criticalTime = stopTime - startTime;
print("critical", REPS, balance, criticalTime, criticalTime/REPS);

My result is:

After 1000000 $1 deposits using 'atomic':
        - balance = 1000000.00,
        - total time = 0.421999931335,
        - average time per deposit = 0.000000422000

After 1000000 $1 deposits using 'critical':
        - balance = 0.00,
        - total time = 0.265000104904,
        - average time per deposit = 0.000000265000

Thanks!

Posting an answer since this scenario does/can indeed exist. (strongly depends on the number of threads you're using)

Consider three cases of the OP's computation (updating balance with 10 7 increments of one each, starting from a value of 0) to time and compare against - one without any form of parallelization (or without the use of openmp directives), one with the use of omp critical for the updates on balance , and one with the use of omp atomic for the same:

#include <omp.h>
#include <iostream>

int main()
{   const int REPS = 1e+7;
    double balance = 0.0;

    std::cout << "Running without any explicit parallelization from openmp:" << std::endl;
    auto startTime = omp_get_wtime();
    for (int i = 0; i < REPS; i++) {
        balance += 1.0;
    }
    auto stopTime = omp_get_wtime();
    std::cout << "Balance: " << balance << ", Total time taken: " << stopTime - startTime << std::endl;
    // Reset balance:
    balance = 0.0;

    std::cout << "Running with omp critical:" << std::endl;
    startTime = omp_get_wtime();
    #pragma omp parallel for
    for (int i = 0; i < REPS; i++) {
        #pragma omp critical
        balance += 1.0;
    }
    stopTime = omp_get_wtime();
    std::cout << "Balance: " << balance << ", Total time taken: " << stopTime - startTime << std::endl;
    // Reset balance:
    balance = 0.0;

    std::cout << "Running with omp atomic:" << std::endl;
    startTime = omp_get_wtime();
    #pragma omp parallel for
    for (int i = 0; i < REPS; i++) {
        #pragma omp atomic
        balance += 1.0;
    }
    stopTime = omp_get_wtime();
    std::cout << "Balance: " << balance << ", Total time taken: " << stopTime - startTime << std::endl;
}

Here are two runs on my machine:

运行(0)

For single computations where omp atomic is applicable, it is expected to run faster than omp critical yes, however, it is possible that on the contrary, it can sometimes be slightly more time consuming. For instance, it didn't hold true for my second run above. I'm assuming this could be one case of 'what should be faster appears to be a bit slower', when you're not specifying the number of threads (the rare case). It is highly unlikely to get this same result (atomic time > critical time) everytime (like the OP mentioned) for more number of threads, or when just going by the default value for modern day computers, given that they have a handful of cores. (the thread count is usually given by the number of cores, and could be twice or more thanks to hyperthreading)
In fact, in order for me to replicate such a run as well, it took a few tries. (I wanted to show two runs where both the cases are possible, while going with the default number of threads, or without specifying the thread count)

Before I talk about the case where such behaviour can happen everytime, I would like to mention that the reason I incorporated the first case (without openmp directives) is to show that it would be faster than the cases where omp critical or omp atomic is used, given that the code segment we are running is sequential. By going with critical / atomic , we are just introducing additional overhead through locking and unlocking. (same as what you would expect when comparing such a case without and with mutex locks, when using pthreads)

Now, if I were to replicate a situation that the OP faced (recurring behaviour of time taken by atomic being greater than critical), I would set my thread count to the lowest possible, while still keeping in mind to not make it single-threaded: (ie 2)

omp_set_num_threads(2);

Now, one can observe that the time taken by the omp atomic block is always greater than the time taken by the omp critical block. For instance, two runs with this modification: (not a rare case)

运行(1)

Fix

In order to make the atomic variant always run faster than the critical equivalent for both the cases, a reduction clause can be used with the appropriate parameters: (ie use the '+' operator given that we are performing addition, followed by the variable we are aggregating here)

#pragma omp parallel for reduction(+:balance)

Two runs after making this change (for both cases two and three), while keeping the thread count to 2:

运行(2)

Finally, omp atomic now runs way faster than the critical equivalent, much like the general expectation.

Another set of two runs with this change, but without setting the number of threads:

运行(3)

Not only does omp atomic outperform omp critical , it now runs even faster (always) than the case (number one) without the use of omp directives. (successful parallelization)

I guess increasing the float number is different with increasing the integer. It depends on the CPU architecture. When I test with integers, it is ok.

See my result: the atomic is more than double faster than critical, but it still much slow compared with not using atomic and critical, even the result is not correct.

So, try my best to avoid lock, critical, atomic, if possible.

test result:

without atomic, critical: 6666667, 0.000113381

atomic: 10000000, 0.399095

critical: 10000000, 0.999381

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM