CUDA：减少算法

Question

I am new to C++/CUDA . 我是C ++ / CUDA的新手。 I tried implementing the parallel algorithm " reduce " with ability to handle any type of inputsize, and threadsize without increasing asymptotic parallel runtime by recursing over the output of the kernel (in the kernel wrapper ). 我尝试通过对内核的输出进行递归（在内核包装器中 ）来实现具有减少所有类型输入大小的能力的并行算法“ reduce ”，并在不增加渐近并行运行时的情况下实现线程化。

eg Implementing Max Reduce in Cuda the top answer to this question, his/hers implementation will essentially be sequential when threadsize is small enough. 例如，在Cuda中实现Max Reduce是该问题的最佳答案，当线程大小足够小时，他/她的实现将基本上是顺序执行的。

However, I keep getting a " Segmentation fault " when I compile and run it ..? 但是，当我编译并运行它时，我总是收到“ 分段错误 ”。

>> nvcc -o mycode mycode.cu
>> ./mycode
Segmentail fault.

Compiled on a K40 with cuda 6.5 在带有CUDA 6.5的K40上编译

Here is the kernel , basically same as the SO post I linked the checker for "out of bounds" is different: 这是内核，与我将检查器链接到“越界”的SO帖子基本相同，不同之处在于：

#include <stdio.h>

/* -------- KERNEL -------- */
__global__ void reduce_kernel(float * d_out, float * d_in, const int size)
{
  // position and threadId
  int pos = blockIdx.x * blockDim.x + threadIdx.x;
  int tid = threadIdx.x;

  // do reduction in global memory
  for (unsigned int s = blockDim.x / 2; s>0; s>>=1)
  {
    if (tid < s)
    {
      if (pos+s < size) // Handling out of bounds
      {
        d_in[pos] = d_in[pos] + d_in[pos+s];
      }
    }
  }

  // only thread 0 writes result, as thread
  if (tid==0)
  {
    d_out[blockIdx.x] = d_in[pos];
  }
}

The kernel wrapper I mentioned to handle when 1 block will not contain all of the data. 我提到的用于处理1个块的内核包装程序将不包含所有数据。

/* -------- KERNEL WRAPPER -------- */
void reduce(float * d_out, float * d_in, const int size, int num_threads)
{
  // setting up blocks and intermediate result holder
  int num_blocks = ((size) / num_threads) + 1;
  float * d_intermediate;
  cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);

  // recursively solving, will run approximately log base num_threads times.
  do
  {
    reduce_kernel<<<num_blocks, num_threads>>>(d_intermediate, d_in, size);

    // updating input to intermediate
    cudaMemcpy(d_in, d_intermediate, sizeof(float)*num_blocks, cudaMemcpyDeviceToDevice);

    // Updating num_blocks to reflect how many blocks we now want to compute on
      num_blocks = num_blocks / num_threads + 1;

    // updating intermediate
    cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
  }
  while(num_blocks > num_threads); // if it is too small, compute rest.

  // computing rest
  reduce_kernel<<<1, num_blocks>>>(d_out, d_in, size);

}

Main program to initialize in/out and create bogus data for testing. 主程序初始化输入/输出并创建虚假数据进行测试。

/* -------- MAIN -------- */
int main(int argc, char **argv)
{
  // Setting num_threads
  int num_threads = 512;
  // Making bogus data and setting it on the GPU
  const int size = 1024;
  const int size_out = 1;
  float * d_in;
  float * d_out;
  cudaMalloc(&d_in, sizeof(float)*size);
  cudaMalloc((void**)&d_out, sizeof(float)*size_out);
  const int value = 5;
  cudaMemset(d_in, value, sizeof(float)*size);

  // Running kernel wrapper
  reduce(d_out, d_in, size, num_threads);

  printf("sum is element is: %.f", d_out[0]);
}

Answer 1

There are a few things I would point out with your code. 我会在您的代码中指出几点。

As a general rule/boilerplate, I always recommend using proper cuda error checking and run your code with cuda-memcheck , any time you are having trouble with a cuda code. 通常，我总是建议使用适当的cuda错误检查，并在遇到cuda代码问题时使用cuda-memcheck运行代码。 However those methods wouldn't help much with the seg fault, although they may help later (see below). 但是，这些方法对于段错误不会有太大帮助，尽管稍后可能会有所帮助（请参阅下文）。
The actual seg fault is occurring on this line: 实际的段错误发生在此行上：
```
 printf("sum is element is: %.f", d_out[0]); 
```
you've broken a cardinal CUDA programming rule: host pointers must not be dereferenced in device code, and device pointers must not be dereferenced in host code. 您已经违反了基本的CUDA编程规则：不得在设备代码中取消引用主机指针，并且不得在主机代码中取消引用设备指针。 This latter condition applies here. 后一种情况在这里适用。 d_out is a device pointer (allocated via cudaMalloc ). d_out是设备指针（通过cudaMalloc分配）。 Such pointers have no meaning if you attempt to dereference them in host code, and doing so will lead to a seg fault. 如果尝试在主机代码中取消引用这些指针，则没有任何意义，否则将导致段错误。
The solution is to copy the data back to the host before printing it out: 解决方案是在打印出来之前将数据复制回主机：
```
 float result; cudaMemcpy(&result, d_out, sizeof(float), cudaMemcpyDeviceToHost); printf("sum is element is: %.f", result); 
```
Using cudaMalloc in a loop, on the same variable, without doing any cudaFree operations, is not good practice, and may lead to out-of-memory errors in long-running loops, and may also lead to programs with memory leaks, if such a construct is used in a larger program: 在循环中，对同一个变量使用cudaMalloc而不执行任何cudaFree操作，不是一种好习惯，并且可能导致长时间运行的循环出现内存不足错误，并且可能会导致程序出现内存泄漏（如果这样）一个较大的程序中使用了一个构造：
```
 do { ... cudaMalloc(&d_intermediate, sizeof(float)*num_blocks); } while... 
```
in this case I think a better approach and trivial fix would be to cudaFree d_intermediate right before you re-allocate: 在这种情况下，我认为更好的方法和简单的解决方法是在重新分配之前立即对cudaFree d_intermediate进行操作：
```
 do { ... cudaFree(d_intermediate); cudaMalloc(&d_intermediate, sizeof(float)*num_blocks); } while... 
```
This might not be doing what you think it is: 这可能不是您想的那样：
```
 const int value = 5; cudaMemset(d_in, value, sizeof(float)*size); 
```
probably you are aware of this, but cudaMemset , like memset , operates on byte quantities. 也许您已经意识到了这一点，但是cudaMemset和memset一样，对字节数进行操作。 So you are filling the d_in array with a value corresponding to 0x05050505 (and I have no idea what that bit pattern corresponds to when interpreted as a float quantity). 因此，您d_in对应于0x05050505的值填充d_in数组（而且我不知道当解释为float时该位模式对应于什么）。 Since you refer to bogus values, you may be cognizant of this already. 由于您引用的是伪造的值，因此您可能已经意识到这一点。 But it's a common error (eg if you were actually trying to initialize the array with the value of 5 in every float location), so I thought I would point it out. 但这是一个常见错误（例如，如果您实际上试图在每个float位置使用5初始化数组），所以我想指出这一点。

Your code has other issues as well (which you will discover if you make the above fixes then run your code with cuda-memcheck ). 您的代码也存在其他问题（如果您进行了上述修复，然后使用cuda-memcheck运行代码，则会发现这些问题）。 To learn about how to do good parallel reductions, I would recommend studying the CUDA parallel reduction sample code and presentation . 要了解如何进行良好的并行约简，我建议您学习CUDA并行约简示例代码和演示文稿。 Parallel reductions in global memory are not recommended for performance reasons. 由于性能原因，不建议并行减少全局内存。

For completeness, here are some of the additional issues I found: 为了完整起见，这是我发现的一些其他问题：

Your kernel code needs an appropriate __syncthreads() statement to ensure that the work of all threads in a block are complete before any threads go onto the next iteration of the for-loop. 您的内核代码需要一个适当的__syncthreads()语句，以确保块中所有线程的工作在任何线程进入for循环的下一次迭代之前都已完成。
Your final write to global memory in the kernel needs to also be conditioned on the read-location being in-bounds. 您对内核中全局内存的最终写入也必须以入站的读取位置为条件。 Otherwise, your strategy of always launching an extra block would allow the read from this line to be out-of-bounds ( cuda-memcheck will show this). 否则，您始终启动额外块的策略将使从该行的读取越界（ cuda-memcheck将显示此内容）。
The reduction logic in your loop in the reduce function is generally messed up and needed to be re-worked in several ways. reduce函数中循环中的reduce逻辑通常被弄乱了，需要以几种方式进行重新处理。

I'm not saying this code is defect-free, but it seems to work for the given test case and produce the correct answer (1024): 我并不是说这段代码是无缺陷的，但它似乎适用于给定的测试用例并产生正确的答案（1024）：

#include <stdio.h>

/* -------- KERNEL -------- */
__global__ void reduce_kernel(float * d_out, float * d_in, const int size)
{
  // position and threadId
  int pos = blockIdx.x * blockDim.x + threadIdx.x;
  int tid = threadIdx.x;

  // do reduction in global memory
  for (unsigned int s = blockDim.x / 2; s>0; s>>=1)
  {
    if (tid < s)
    {
      if (pos+s < size) // Handling out of bounds
      {
        d_in[pos] = d_in[pos] + d_in[pos+s];
      }
    }
    __syncthreads();
  }

  // only thread 0 writes result, as thread
  if ((tid==0) && (pos < size))
  {
    d_out[blockIdx.x] = d_in[pos];
  }
}

/* -------- KERNEL WRAPPER -------- */
void reduce(float * d_out, float * d_in, int size, int num_threads)
{
  // setting up blocks and intermediate result holder
  int num_blocks = ((size) / num_threads) + 1;
  float * d_intermediate;
  cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
  cudaMemset(d_intermediate, 0, sizeof(float)*num_blocks);
  int prev_num_blocks;
  // recursively solving, will run approximately log base num_threads times.
  do
  {
    reduce_kernel<<<num_blocks, num_threads>>>(d_intermediate, d_in, size);

    // updating input to intermediate
    cudaMemcpy(d_in, d_intermediate, sizeof(float)*num_blocks, cudaMemcpyDeviceToDevice);

    // Updating num_blocks to reflect how many blocks we now want to compute on
      prev_num_blocks = num_blocks;
      num_blocks = num_blocks / num_threads + 1;

    // updating intermediate
    cudaFree(d_intermediate);
    cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
    size = num_blocks*num_threads;
  }
  while(num_blocks > num_threads); // if it is too small, compute rest.

  // computing rest
  reduce_kernel<<<1, prev_num_blocks>>>(d_out, d_in, prev_num_blocks);

}

/* -------- MAIN -------- */
int main(int argc, char **argv)
{
  // Setting num_threads
  int num_threads = 512;
  // Making non-bogus data and setting it on the GPU
  const int size = 1024;
  const int size_out = 1;
  float * d_in;
  float * d_out;
  cudaMalloc(&d_in, sizeof(float)*size);
  cudaMalloc((void**)&d_out, sizeof(float)*size_out);
  //const int value = 5;
  //cudaMemset(d_in, value, sizeof(float)*size);
  float * h_in = (float *)malloc(size*sizeof(float));
  for (int i = 0; i <  size; i++) h_in[i] = 1.0f;
  cudaMemcpy(d_in, h_in, sizeof(float)*size, cudaMemcpyHostToDevice);

  // Running kernel wrapper
  reduce(d_out, d_in, size, num_threads);
  float result;
  cudaMemcpy(&result, d_out, sizeof(float), cudaMemcpyDeviceToHost);
  printf("sum is element is: %.f\n", result);
}

CUDA：减少算法

问题描述

1 个解决方案

解决方案1
4 已采纳 2016-01-04 17:42:07

CUDA：减少算法

问题描述

1 个解决方案

解决方案1 4 已采纳 2016-01-04 17:42:07

解决方案1
4 已采纳 2016-01-04 17:42:07