CUDA kernel：当循环数增加 10% 时，性能下降 10 倍

Question

I have a simple CUDA kernel to test loop unrolling, then discovered another thing: when the loop count is 10, kernel takes 34 milliseconds to perform, when the loop count is 90, it takes 59 milliseconds, but when the loop count is 100, the time it takes is 423 milliseconds, Launch configuration is the same. I have a simple CUDA kernel to test loop unrolling, then discovered another thing: when the loop count is 10, kernel takes 34 milliseconds to perform, when the loop count is 90, it takes 59 milliseconds, but when the loop count is 100,耗时423毫秒，Launch配置相同。 only loop count changed, So, my question is?只有循环计数改变了，所以，我的问题是？ what could be the reason for this performance drop?这种性能下降的原因可能是什么？

Here is the code, input is an array of 128x1024x1024 elements, and I'm using PyCUDA:这是代码，输入是一个 128x1024x1024 元素的数组，我使用的是 PyCUDA：

__global__ void copy(float *input, float *output) {
  int tidx = blockIdx.y * blockDim.x + threadIdx.x;
  int stride = 1024 * 1024;
  for (int i = 0; i < 128; i++) {
    int idx = i * stride + tidx;
    float x = input[idx];
    float y = 0;

    for (int j = 0; j < 100; j += 10) {
      x = x + sqrt(float(j));
      y = sqrt(abs(x)) + sin(x) + cos(x);

      x = x + sqrt(float(j+1));
      y = sqrt(abs(x)) + sin(x) + cos(x);

      x = x + sqrt(float(j+2));
      y = sqrt(abs(x)) + sin(x) + cos(x);

      x = x + sqrt(float(j+3));
      y = sqrt(abs(x)) + sin(x) + cos(x);

      x = x + sqrt(float(j+4));
      y = sqrt(abs(x)) + sin(x) + cos(x);

      x = x + sqrt(float(j+5));
      y = sqrt(abs(x)) + sin(x) + cos(x);

      x = x + sqrt(float(j+6));
      y = sqrt(abs(x)) + sin(x) + cos(x);

      x = x + sqrt(float(j+7));
      y = sqrt(abs(x)) + sin(x) + cos(x);

      x = x + sqrt(float(j+8));
      y = sqrt(abs(x)) + sin(x) + cos(x);

      x = x + sqrt(float(j+9));
      y = sqrt(abs(x)) + sin(x) + cos(x);
    }

    output[idx] = y;
  }
}

The loop count I mentioned is this line:我提到的循环计数是这一行：

for (int j = 0; j < 100; j += 10)

And sample outputs here:并在此处提供示例输出：

10 loops 10 个循环

griddimx: 1 griddimy: 1024 griddimz: 1
blockdimx: 1024 blockdimy: 1 blockdimz: 1
nthreads: 1048576 blocks: 1024
prefetch.py:82: UserWarning: The CUDA compiler succeeded, but said the following:
ptxas info    : 0 bytes gmem, 24 bytes cmem[3]
ptxas info    : Compiling entry function 'copy' for 'sm_61'
ptxas info    : Function properties for copy
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 21 registers, 336 bytes cmem[0], 52 bytes cmem[2]

computation takes 34.24 miliseconds计算耗时 34.24 毫秒

90 loops 90 圈

griddimx: 1 griddimy: 1024 griddimz: 1
blockdimx: 1024 blockdimy: 1 blockdimz: 1
nthreads: 1048576 blocks: 1024
prefetch.py:82: UserWarning: The CUDA compiler succeeded, but said the following:
ptxas info    : 0 bytes gmem, 24 bytes cmem[3]
ptxas info    : Compiling entry function 'copy' for 'sm_61'
ptxas info    : Function properties for copy
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 21 registers, 336 bytes cmem[0], 52 bytes cmem[2]

computation takes 59.33 miliseconds计算耗时 59.33 毫秒

100 loops 100 个循环

griddimx: 1 griddimy: 1024 griddimz: 1
blockdimx: 1024 blockdimy: 1 blockdimz: 1
nthreads: 1048576 blocks: 1024
prefetch.py:82: UserWarning: The CUDA compiler succeeded, but said the following:
ptxas info    : 0 bytes gmem, 24 bytes cmem[3]
ptxas info    : Compiling entry function 'copy' for 'sm_61'
ptxas info    : Function properties for copy
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 22 registers, 336 bytes cmem[0], 52 bytes cmem[2]

computation takes 422.96 miliseconds计算耗时 422.96 毫秒

Answer 1

The problem seems to come from loop unrolling .问题似乎来自循环展开。

Indeed, the 10-loops case can be trivially unrolled by NVCC since the loop is actually always executed once (thus the for line can be removed with j set to 0).实际上，NVCC 可以轻松展开10-loops情况，因为循环实际上总是执行一次（因此可以在 j 设置为 0 的情况下删除for行）。 The 90-loops case is unrolled by NVCC (there are only 9 actual iterations). NVCC 展开了90-loops的情况（只有 9 次实际迭代）。 The resulting code is thus much bigger but still fast since no branches are performed (GPUs hate branches).因此，生成的代码要大得多，但仍然很快，因为没有执行分支（GPU 讨厌分支）。 However, the 100-loops case is not unrolled by NVCC (you hit a threshold of the compiler optimizer).但是，NVCC不会展开100-loops的情况（您达到了编译器优化器的阈值）。 The resulting code is small, but it leads to more branches being executed at runtime: branching is performed for each executed loop iteration (a total of 10).生成的代码很小，但会导致在运行时执行更多的分支：每次执行的循环迭代都会执行分支（总共 10 次）。 You can see the assembly code difference here .您可以在此处查看汇编代码差异。

You can force unrolling using the directive #pragma unroll .您可以使用指令#pragma unroll强制展开。 However, keep in mind that increasing the size of a code can reduce its performance.但是，请记住，增加代码的大小会降低其性能。

PS: the slightly higher number of register used in the last version may decrease performance, but simulations show that it should be OK in this case. PS：上一版本使用的寄存器数量稍多，可能会降低性能，但模拟表明在这种情况下应该没问题。

CUDA kernel：当循环数增加 10% 时，性能下降 10 倍

问题描述

1 个解决方案

解决方案1
4 已采纳 2020-06-18 10:20:24

CUDA kernel：当循环数增加 10% 时，性能下降 10 倍

问题描述

1 个解决方案

解决方案1 4 已采纳 2020-06-18 10:20:24

解决方案1
4 已采纳 2020-06-18 10:20:24