C++ amp atomics

Question

I am rewriting an algorithm in C++ AMP and just ran into an issue with atomic writes, more specifically atomic_fetch_add, which apparently is only for integers?

I need to add a double_4 (or if I have to, a float_4) in an atomic fashion. How do I accomplish that with C++ AMP's atomics?

Is the best/only solution really to have a lock variable which my code can use to control the writes? I actually need to do atomic writes for a long list of output doubles, so I would essentially need a lock for every output.

I have already considered tiling this for better performance, but right now I am just in the first iteration.

EDIT: Thanks for the quick answers already given. I have a quick update to my question though.

I made the following lock attempt, but it seems that when one thread in a warp gets past the lock, all the other threads in the same warp just tags along. I was expecting the first warp thread to get the lock, but I must be missing something (note that it has been quite a few years since my cuda days, so I have just gotten dumb)

parallel_for_each(attracting.extent, [=](index<1> idx) restrict(amp)
{
   .....
   for (int j = 0; j < attracted.extent.size(); j++)
   {
      ...
      int lock = 0; //the expected lock value
      while (!atomic_compare_exchange(&locks[j], &lock, 1));
      //when one warp thread gets the lock, ALL threads continue on
      ...
      acceleration[j] += ...; //locked write
      locks[j] = 0; //leaving the lock again
   }
});

It is as such not a big problem, since I should write into a shared variable at first and only write it to global memory after all threads in a tile have completed, but I just don't understand this behavior.

Answer 1

All the atomic add ops are only for integer types. You can do what you want without locks using 128-bit CAS (compare-and-swap) operations though for float_4 (I'm assuming this is 4 floats), but there's no 256-bit CAS ops what you would need for double_4. What you have to do is to have a loop which atomically reads float_4 from memory, perform the float add in the regular way, and then use CAS to test & swap the value if it's the original (and loop if not, ie some other thread changed the value between read & write). Note that the 128-bit CAS is only available on 64-bit architectures and that your data needs to be properly aligned.

Answer 2

if the critical code is short, you can create your own lock using atomic operations:

int lock = 1;

while(__sync_lock_test_and_set(&lock, 0) == 0) // trying to acquire lock
{
 //yield the thread or go to sleep
} 

//critical section, do the work

// release lock
lock = 1;

the advantage is you save the overhead of the OS locks.

Answer 3

The question has as such been answered by others and the answer is that you need to handle double atomics yourself. There is no function for it in the library.

I would also like to elaborate on my own edit, in case others come here with the same failing lock.

In the following example, my error was in not realizing that when the exchange failed, it actually changed the expected value! Thus the first thread would expect lock being zero and write a 1 in it. The next thread would expect 0 and would fail to write a one - but then the exchange wrote a one in the variable holding the expected value. This means that the next time the thread tried to do an exchange it expects a 1 in the lock! This it gets and then it thinks it gets the lock.

I was absolutely not aware that the &lock would receive a 1 on failed exchange match!

parallel_for_each(attracting.extent, [=](index<1> idx) restrict(amp)
{
   .....
   for (int j = 0; j < attracted.extent.size(); j++)
   {
      ...
      int lock = 0; //the expected lock value

      **//note that, if locks[j]!=lock then lock=1
      //meaning that ACE will be true the next time if locks[j]==1
      //meaning the while will terminate even though someone else has the lock**
      while (!atomic_compare_exchange(&locks[j], &lock, 1));
      //when one warp thread gets the lock, ALL threads continue on
      ...
      acceleration[j] += ...; //locked write
      locks[j] = 0; //leaving the lock again
   }
});

It seems that a fix is to do

parallel_for_each(attracting.extent, [=](index<1> idx) restrict(amp)
{
   .....
   for (int j = 0; j < attracted.extent.size(); j++)
   {
      ...
      int lock = 0; //the expected lock value

      while (!atomic_compare_exchange(&locks[j], &lock, 1))
      {
          lock=0; //reset the expected value
      };
      //when one warp thread gets the lock, ALL threads continue on
      ...
      acceleration[j] += ...; //locked write
      locks[j] = 0; //leaving the lock again
   }
});

C++ amp atomics

Question

3 answers

solution1
2 ACCPTED 2014-06-20 12:22:09

solution2
1 2014-06-20 12:31:47

solution3
0 2014-06-20 16:22:17

C++ amp atomics

Question

3 answers

solution1 2 ACCPTED 2014-06-20 12:22:09

solution2 1 2014-06-20 12:31:47

solution3 0 2014-06-20 16:22:17

solution1
2 ACCPTED 2014-06-20 12:22:09

solution2
1 2014-06-20 12:31:47

solution3
0 2014-06-20 16:22:17