Spinning thread barrier using Atomic Builtins

Question

I'm trying to implement a spinning thread barrier using atomics, specifically __sync_fetch_and_add. https://gcc.gnu.org/onlinedocs/gcc-4.4.5/gcc/Atomic-Builtins.html

I basically want an alternative to the pthread barrier. I'm using Ubuntu on a system that can run about a hundred threads in parallel.

int bar = 0;                      //global variable
 int P = MAX_THREADS;              //number of threads

 __sync_fetch_and_add(&bar,1);     //each thread comes and adds atomically
 while(bar<P){}                    //threads spin until bar increments to P
 bar=0;                            //a thread sets bar=0 to be used in the next spinning barrier

This does not work for obvious reasons (a thread may set bar=0, and another thread gets stuck in an infinite while loop etc). I saw an implementation here: Writing a (spinning) thread barrier using c++11 atomics, however it seems too complex and I think its performance might be worse than a pthread barrier.

This implementation is also expected to produce more traffic within the memory hierarchy due to bar's cache line being ping-ponged among threads.

Any ideas on how to use these atomic instructions to make a simple barrier? A communication-optimal scheme would also be helpful additionally.

Answer 1

Instead of spinning on the counter of the threads , it is better to spin on the number of the barries passed , which will be incremented only by the last thread, faced the barrier. Such way you also reduce memory cache pressure, as spinning variable is now updated only by single thread.

int P = MAX_THREADS;
int bar = 0; // Counter of threads, faced barrier.
volatile int passed = 0; // Number of barriers, passed by all threads.

void barrier_wait()
{
    int passed_old = passed; // Should be evaluated before incrementing *bar*!

    if(__sync_fetch_and_add(&bar,1) == (P - 1))
    {
        // The last thread, faced barrier.
        bar = 0;
        // *bar* should be reseted strictly before updating of barriers counter.
        __sync_synchronize(); 
        passed++; // Mark barrier as passed.
    }
    else
    {
        // Not the last thread. Wait others.
        while(passed == passed_old) {};
        // Need to synchronize cache with other threads, passed barrier.
        __sync_synchronize();
    }
}

Note, that you need to use volatile modificator for spinning variable.

C++ code could be somewhat faster than C one, as it can use acquire / release memory barriers instead of the full one, which is the only barrier available from __sync functions:

int P = MAX_THREADS;
std::atomic<int> bar = 0; // Counter of threads, faced barrier.
std::atomic<int> passed = 0; // Number of barriers, passed by all threads.

void barrier_wait()
{
    int passed_old = passed.load(std::memory_order_relaxed);

    if(bar.fetch_add(1) == (P - 1))
    {
        // The last thread, faced barrier.
        bar = 0;
        // Synchronize and store in one operation.
        passed.store(passed_old + 1, std::memory_order_release);
    }
    else
    {
        // Not the last thread. Wait others.
        while(passed.load(std::memory_order_relaxed) == passed_old) {};
        // Need to synchronize cache with other threads, passed barrier.
        std::atomic_thread_fence(std::memory_order_acquire);
    }
}

Spinning thread barrier using Atomic Builtins

Question

1 answers

solution1
2 ACCPTED 2015-11-09 12:47:53

Spinning thread barrier using Atomic Builtins

Question

1 answers

solution1 2 ACCPTED 2015-11-09 12:47:53

solution1
2 ACCPTED 2015-11-09 12:47:53