Understanding the method for OpenCL reduction on float

Question

Following this link , I try to understand the operating of kernel code (there are 2 versions of this kernel code, one with volatile local float *source and the other with volatile global float *source , ie local and global versions). Below I take local version :

float sum=0;
void atomic_add_local(volatile local float *source, const float operand) {
    union {
        unsigned int intVal;
        float floatVal;
    } newVal;

    union {
        unsigned int intVal;
        float floatVal;
    } prevVal;

    do {
        prevVal.floatVal = *source;
        newVal.floatVal = prevVal.floatVal + operand;
    } while (atomic_cmpxchg((volatile local unsigned int *)source, prevVal.intVal, newVal.intVal) != prevVal.intVal);
}

If I understand well, each work-item shares the access to source variable thanks to the qualifier " volatile ", doesn't it?

Afterwards, if I take a work-item, the code will add operand value to newVal.floatVal variable. Then, after this operation, I call atomic_cmpxchg function which check if previous assignment ( preVal.floatVal = *source; and newVal.floatVal = prevVal.floatVal + operand; ) has been done, ie by comparing the value stored at address source with the preVal.intVal .

During this atomic operation (which is not uninterruptible by definition), as value stored at source is different from prevVal.intVal , the new value stored at source is newVal.intVal , which is actually a float (because it is coded on 4 bytes like integer).

Can we say that each work-item has a mutex access (I mean a locked access) to value located at source address .

But for each work-item thread, is there only one iteration into the while loop ?

I think there will be one iteration because the comparison " *source== prevVal.int ? newVal.intVal : newVal.intVal " will always assign newVal.intVal value to value stored at source address , won't it?

Any help is welcome because I have not understood all the subtleties of this trick for this kernel code.

UPDATE 1 :

Sorry, I almost understand all the subtilities, especially in the while loop :

First case : for a given single thread, before the call of atomic_cmpxchg, if prevVal.floatVal is still equal to *source , then atomic_cmpxchg will change the value contained in source pointer and return the value contained in old pointer , which is equal to prevVal.intVal , so we break from the while loop .

Second case : If between the prevVal.floatVal = *source; instruction and the call of atomic_cmpxchg , the value *source has changed (by another thread ??) then atomic_cmpxchg returns old value which is no more equal to prevVal.floatVal , so the condition into while loop is true and we stay in this loop until previous condition isn't checked anymore.

My interpretation is correct ?

Thanks

Answer 1

If I understand well, each work-item shares the access to source variable thanks to the qualifier " volatile ", doesn't it?

volatile is a keyword of the C language that prevents the compiler from optimizing accesses to a specific location in memory (in other words, force a load/store at each read/write of said memory location). It has no impact on the ownership of the underlying storage. Here, it is used to force the compiler to re-read source from memory at each loop iteration (otherwise the compiler would be allowed to move that load outside the loop, which breaks the algorithm).

do {
    prevVal.floatVal = *source; // Force read, prevent hoisting outside loop.
    newVal.floatVal = prevVal.floatVal + operand;
} while(atomic_cmpxchg((volatile local unsigned int *)source, prevVal.intVal, newVal.intVal) != prevVal.intVal)

After removing qualifiers (for simplicity) and renaming parameters, the signature of atomic_cmpxchg is the following:

int atomic_cmpxchg(int *ptr, int expected, int new)

What it does is:

atomically {
    int old = *ptr;

    if (old == expected) {
        *ptr = new;
    }

    return old;
}

To summarize, each thread, individually, does:

Load current value of *source from memory into preVal.floatVal
Compute desired value of *source in newVal.floatVal
Execute the atomic compare-exchange described above (using the type-punned values)
If the result of atomic_cmpxchg == newVal.intVal , it means the compare-exchange was successful, break. Otherwise, the exchange didn't happen, go to 1 and try again.

The above loop eventually terminates, because eventually , each thread succeeds in doing their atomic_cmpxchg .

Can we say that each work-item has a mutex access (I mean a locked access) to value located at source address.

Mutexes are locks, while this is a lock-free algorithm. OpenCL can simulate mutexes with spinlocks (also implemented with atomics) but this is not one.

Understanding the method for OpenCL reduction on float

Question

1 answers

solution1
1 ACCPTED 2017-01-31 06:21:37

Understanding the method for OpenCL reduction on float

Question

1 answers

solution1 1 ACCPTED 2017-01-31 06:21:37

solution1
1 ACCPTED 2017-01-31 06:21:37