迭代第二周期，CUDA的总和减少

Question

I have to parallelize this code from c ++ to CUDA C 我必须将此代码从c ++并行化为CUDA C

  for(ihist = 0; ihist < numhist; ihist++){ 
      for(iwin = 0; iwin<numwin; iwin++){
          denwham[ihist] += (numbinwin[iwin]/g[iwin])*exp(F[iwin]-U[ihist]); 
          }
          Punnorm[ihist] = numwham[ihist]/denwham[ihist];
        }

In CUDA C, using the sum reduction : 在CUDA C中，使用总和减少：

extern __shared__ float sdata[];
  int tx = threadIdx.x;
  int i=blockIdx.x;
  int j=blockIdx.y;
  float sum=0.0;
  float temp=0.0;
  temp=U[j];


   if(tx<numwin)
   {
    sum=(numbinwin[tx]/g[tx])*exp(F[tx]- temp); 
    sdata[tx] = sum;
     __syncthreads();  
   }


  for(int offset = blockDim.x / 2;offset > 0;offset >>= 1)
  {
   if(tx < offset)
   {
    // add a partial sum upstream to our own
    sdata[tx] += sdata[tx + offset];
   }
   __syncthreads();
  }

   // finally, thread 0 writes the result
  if(threadIdx.x == 0)
  {
   // note that the result is per-block
   // not per-thread
   denwham[i] = sdata[0];

    for(int k=0;k<numhist;k++)
    Punnorm[k] = numwham[k]/denwham[k];
  }

And initialize it in this way: 并以这种方式初始化它：

 int smem_sz = (256)*sizeof(float);
  dim3 Block(numhist,numhist,1);
  NewProbabilitiesKernel<<<Block,256,smem_sz>>>(...);

My problem is that I cannot iterate over U using exp , I have tried the following: 我的问题是我无法使用exp遍历U，我尝试了以下操作：

a) loop for/while inside the kernel that iterates over U 
b) iterate by thread
c) iterate to block

All these attempts lead me to different results between C++ code and code cuda.The code works fine if instead of U [i] I put a constant! 所有这些尝试使我在C ++代码和代码cuda之间得出了不同的结果。如果代替U [i]我输入一个常数，则代码可以正常工作！

have you any idea to help me ? 你有什么办法帮助我吗？

thanks. 谢谢。

Answer 1

It looks like you could move the U out of the inner loop by 看来您可以将U移出内部循环

for(iwin = 0; iwin<numwin; iwin++){
    denwham += numbinwin[iwin] / g[iwin] * exp(F[iwin]); 
}
for(ihist = 0; ihist < numhist; ihist++){ 
    Punnorm[ihist] = numwham[ihist] / denwham * exp(U[ihist]);
}

Update 更新

After that you could use 2 simple kernels instead of 1 complex one to finish the task. 之后，您可以使用2个简单内核而不是1个复杂内核来完成任务。

reduction kernel to compute denwham ; 还原核计算denwham ;
1-D transform kernel to compute Punnorm ; 一维变换核计算Punnorm ;

迭代第二周期，CUDA的总和减少

问题描述

1 个解决方案

解决方案1
1 2013-10-28 09:48:59

Update 更新

迭代第二周期，CUDA的总和减少

问题描述

1 个解决方案

解决方案1 1 2013-10-28 09:48:59

Update 更新

解决方案1
1 2013-10-28 09:48:59