CUDA中的并行列表减少

Question

I am working through the Cuda Parallel reduction Whitepaper, but unfortunately my algorithm seems to repeatedly produce incorrect results, and I can not seem to figure out why(surely a textbook example must work? Surely I'm just doing something very obvious wrong?). 我正在研究《 Cuda并行缩减白皮书》，但是不幸的是，我的算法似乎反复产生不正确的结果，而且我似乎无法弄清楚为什么（肯定是教科书示例必须起作用？当然，我只是在做一些非常明显的错误？）。 Here is my kernel function: 这是我的内核函数：

My define: 我的定义：

 #define BLOCK_SIZE 512

My Kernel function: 我的内核功能：

 __global__ void total(float * inputList, float * outputList, int len) {
      __shared__ float sdata[2*BLOCK_SIZE];
      unsigned int tid = threadIdx.x;
      unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
      sdata[t] = inputList[i]+inputList[i+blockDim.x];
      __syncthreads();
      for (unsigned int s=blockDim.x/2; s>0; s>>=1) {
        if (tid < s) {
          sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
      }
      if (tid == 0) 
        outputList[blockIdx.x] = sdata[0];
}

My memory allocation: 我的内存分配：

  outputSize = inputSize / (BLOCK_SIZE<<1);
  cudaMalloc((void**) &deviceInput, inputSize*sizeof(float));
  cudaMalloc((void**) &deviceOutput, outputSize*sizeof(float));
  cudaMemcpy(deviceInput, hostInput, inputSize*sizeof(float), cudaMemcpyHostToDevice);

My device call: 我的设备呼叫：

 dim3 dimGrid((inputSize-1)/BLOCK_SIZE +1, 1, 1);
 dim3 dimBlock(BLOCK_SIZE,1,1);

 total<<<dimBlock, dimGrid>>>(deviceInput, deviceOutput, outputSize);
 cudaDeviceSynchronize();

My memory fetch: 我的内存获取：

 cudaMemcpy(hostOutput, deviceOutput, outputSize*sizeof(float), cudaMemcpyDeviceToHost);

And finally my final calculation: 最后是我的最终计算：

 for (int counter = 1; counter < outputSize; counter++) {
    hostOutput[0] += hostOutput[counter];
 }

Any help would be appreciated. 任何帮助，将不胜感激。

Answer 1

Your kernel launch configuration in the following line of your code is incorrect. 代码的以下行中的内核启动配置不正确。

total<<<dimBlock, dimGrid>>>(deviceInput, deviceOutput, outputSize);

The first argument of kernel configuration is the grid size and the second argument is the block size. 内核配置的第一个参数是网格大小，第二个参数是块大小。

You should be doing this: 您应该这样做：

total<<<dimGrid, dimBlock>>>(deviceInput, deviceOutput, outputSize);

Please always perform error checking on CUDA Runtime function calls and check the returned error codes to get the reason for the failure of your program. 请始终对CUDA Runtime函数调用执行错误检查，并检查返回的错误代码以获取程序失败的原因。

Your kernel launch should fail in your current code. 您的内核启动应该在当前代码中失败。 An error checking on the cudaDeviceSynchronize call would have led you to the reason of incorrect results. 对cudaDeviceSynchronize调用进行错误检查可能会导致您获得错误结果的原因。

Answer 2

The code assumes the input size is a multiple of the block size. 该代码假定输入大小是块大小的倍数。 If inputSize is not a multiple of the block size, it will read off the end of the inputList array. 如果inputSize不是块大小的倍数，它将读取inputList数组的末尾。

CUDA中的并行列表减少

问题描述

2 个解决方案

解决方案1
5 已采纳 2013-01-14 11:06:13

解决方案2
3 2013-01-14 10:29:51

CUDA中的并行列表减少

问题描述

2 个解决方案

解决方案1 5 已采纳 2013-01-14 11:06:13

解决方案2 3 2013-01-14 10:29:51

解决方案1
5 已采纳 2013-01-14 11:06:13

解决方案2
3 2013-01-14 10:29:51