[英]Cuda atomics and conditional branches
I am attempting to write a CUDA
version of a serial
code as a part of implementing a periodic boundary condition in a molecular dynamics algorithm. 我正在尝试编写
serial
代码的CUDA
版本,作为在分子动力学算法中实现周期性边界条件的一部分。 The idea is that a tiny fraction of particles that have positions out of the box need to be put back in using one of two ways
, with a limit on number of times I use the first way. 这个想法是,需要使用两种
ways
一种将一小部分具有开箱即用位置的粒子放回原处,这限制了我使用第一种方法的次数。
Essentially, it boils down to the following MWE. 本质上,它归结为以下MWE。 I have an array
x[N]
where N
is large, and the following serial
code. 我有一个数组
x[N]
,其中N
大,以及以下serial
代码。
#include <cstdlib>
int main()
{
int N =30000;
double x[30000];
int Nmax = 10, count = 0;
for(int i = 0; i < N; i++)
x[i] = 1.0*(rand()%3);
for(int i = 0; i < N; i++)
{
if(x[i] > 2.9)
{
if(count < Nmax)
{
x[i] += 0.1; //first way
count++;
}
else
x[i] -= 0.2; //second way
}
}
}
Please assume that x[i] > 2.9
only for a small fraction (about 12-15) of the 30000 elements of x[i]
. 请假设
x[i] > 2.9
仅针对x[i]
的30000个元素的一小部分(约12-15)。
Note that the sequence of i
is not important, ie it is not necessary to have the 10
lowest i
to use x[i] += 0.1
, making the algorithm potentially parallelizable. 请注意,
i
的顺序并不重要,即,不必使最低的i
10
才能使用x[i] += 0.1
,这使得该算法具有潜在的可并行性。 I thought of the following CUDA
version of the MWE, which compiles with nvcc -arch sm_35 main.cu
, where main.cu
reads as 我想到了以下MWE的
CUDA
版本,该版本使用nvcc -arch sm_35 main.cu
进行编译,其中main.cu
读为
#include <cstdlib>
__global__ void PeriodicCondition(double *x, int *N, int *Nmax, int *count)
{
int i = threadIdx.x+blockIdx.x*blockDim.x;
if(i < N[0])
{
if(x[i] > 2.9)
{
if(count[0] < Nmax[0]) //===============(line a)
{
x[i] += 0.1; //first way
atomicAdd(&count[0],1); //========(line b)
}
else
x[i] -= 0.2; //second way
}
}
}
int main()
{
int N = 30000;
double x[30000];
int Nmax = 10, count = 0;
srand(128512);
for(int i = 0; i < N; i++)
x[i] = 1.0*(rand()%3);
double *xD;
cudaMalloc( (void**) &xD, N*sizeof(double) );
cudaMemcpy( xD, &x, N*sizeof(double),cudaMemcpyHostToDevice );
int *countD;
cudaMalloc( (void**) &countD, sizeof(int) );
cudaMemcpy( countD, &count, sizeof(int),cudaMemcpyHostToDevice );
int *ND;
cudaMalloc( (void**) &ND, sizeof(int) );
cudaMemcpy( ND, &N, sizeof(int),cudaMemcpyHostToDevice );
int *NmaxD;
cudaMalloc( (void**) &NmaxD, sizeof(int) );
cudaMemcpy( NmaxD, &Nmax, sizeof(int),cudaMemcpyHostToDevice );
PeriodicCondition<<<938,32>>>(xD, ND, NmaxD, countD);
cudaFree(NmaxD);
cudaFree(ND);
cudaFree(countD);
cudaFree(xD);
}
Of course, this is not correct because the if
condition on (line a)
uses a variable that is updated in (line b)
, which might not be current. 当然,这是不正确的,因为
(line a)
上的if
条件使用了在(line b)
更新的变量,该变量可能不是当前变量。 This is somewhat similar to Cuda atomics change flag , however, I am not sure if and how using critical sections would help. 这有点类似于Cuda原子更改标志 ,但是,我不确定使用关键部分是否有帮助以及如何使用。
Is there a way to make sure count[0]
is up to date when every thread checks for the if
condition on (line a)
, without making the code too serial? 当每个线程检查
(line a)
的if
条件时,是否有办法确保count[0]
是最新的,而又不会使代码过于串行?
Just increment the atomic counter every time, and use its return value in your test: 只需每次增加原子计数器,然后在测试中使用其返回值 :
...
if(x[i] > 2.9)
{
int oldCount = atomicAdd(&count[0],1);
if(oldCount < Nmax[0])
x[i] += 0.1; //first way
else
x[i] -= 0.2; //second way
}
...
If as you say around 15 items exceed 2.9 and Nmax is around 10, there will be a small number of "extra" atomic operations, the overhead of which is probably minimal (and I can't see how to do it more efficiently, which isn't to say it isn't possible...). 如果您说大约15项超过2.9,而Nmax大约为10,则将有少量“额外”原子操作,其开销可能很小(而且我看不到如何更有效地进行操作)并不是说不可能...)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.