简体   繁体   English

CUDA kernel:当循环数增加 10% 时,性能下降 10 倍

[英]CUDA kernel: performance drops by 10x when increased loop count by 10%

I have a simple CUDA kernel to test loop unrolling, then discovered another thing: when the loop count is 10, kernel takes 34 milliseconds to perform, when the loop count is 90, it takes 59 milliseconds, but when the loop count is 100, the time it takes is 423 milliseconds, Launch configuration is the same. I have a simple CUDA kernel to test loop unrolling, then discovered another thing: when the loop count is 10, kernel takes 34 milliseconds to perform, when the loop count is 90, it takes 59 milliseconds, but when the loop count is 100,耗时423毫秒,Launch配置相同。 only loop count changed, So, my question is?只有循环计数改变了,所以,我的问题是? what could be the reason for this performance drop?这种性能下降的原因可能是什么?

Here is the code, input is an array of 128x1024x1024 elements, and I'm using PyCUDA:这是代码,输入是一个 128x1024x1024 元素的数组,我使用的是 PyCUDA:

__global__ void copy(float *input, float *output) {
  int tidx = blockIdx.y * blockDim.x + threadIdx.x;
  int stride = 1024 * 1024;
  for (int i = 0; i < 128; i++) {
    int idx = i * stride + tidx;
    float x = input[idx];
    float y = 0;

    for (int j = 0; j < 100; j += 10) {
      x = x + sqrt(float(j));
      y = sqrt(abs(x)) + sin(x) + cos(x);

      x = x + sqrt(float(j+1));
      y = sqrt(abs(x)) + sin(x) + cos(x);

      x = x + sqrt(float(j+2));
      y = sqrt(abs(x)) + sin(x) + cos(x);

      x = x + sqrt(float(j+3));
      y = sqrt(abs(x)) + sin(x) + cos(x);

      x = x + sqrt(float(j+4));
      y = sqrt(abs(x)) + sin(x) + cos(x);

      x = x + sqrt(float(j+5));
      y = sqrt(abs(x)) + sin(x) + cos(x);

      x = x + sqrt(float(j+6));
      y = sqrt(abs(x)) + sin(x) + cos(x);

      x = x + sqrt(float(j+7));
      y = sqrt(abs(x)) + sin(x) + cos(x);

      x = x + sqrt(float(j+8));
      y = sqrt(abs(x)) + sin(x) + cos(x);

      x = x + sqrt(float(j+9));
      y = sqrt(abs(x)) + sin(x) + cos(x);
    }

    output[idx] = y;
  }
}

The loop count I mentioned is this line:我提到的循环计数是这一行:

for (int j = 0; j < 100; j += 10)

And sample outputs here:并在此处提供示例输出:

10 loops 10 个循环

griddimx: 1 griddimy: 1024 griddimz: 1
blockdimx: 1024 blockdimy: 1 blockdimz: 1
nthreads: 1048576 blocks: 1024
prefetch.py:82: UserWarning: The CUDA compiler succeeded, but said the following:
ptxas info    : 0 bytes gmem, 24 bytes cmem[3]
ptxas info    : Compiling entry function 'copy' for 'sm_61'
ptxas info    : Function properties for copy
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 21 registers, 336 bytes cmem[0], 52 bytes cmem[2]

computation takes 34.24 miliseconds计算耗时 34.24 毫秒

90 loops 90 圈

griddimx: 1 griddimy: 1024 griddimz: 1
blockdimx: 1024 blockdimy: 1 blockdimz: 1
nthreads: 1048576 blocks: 1024
prefetch.py:82: UserWarning: The CUDA compiler succeeded, but said the following:
ptxas info    : 0 bytes gmem, 24 bytes cmem[3]
ptxas info    : Compiling entry function 'copy' for 'sm_61'
ptxas info    : Function properties for copy
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 21 registers, 336 bytes cmem[0], 52 bytes cmem[2]

computation takes 59.33 miliseconds计算耗时 59.33 毫秒

100 loops 100 个循环

griddimx: 1 griddimy: 1024 griddimz: 1
blockdimx: 1024 blockdimy: 1 blockdimz: 1
nthreads: 1048576 blocks: 1024
prefetch.py:82: UserWarning: The CUDA compiler succeeded, but said the following:
ptxas info    : 0 bytes gmem, 24 bytes cmem[3]
ptxas info    : Compiling entry function 'copy' for 'sm_61'
ptxas info    : Function properties for copy
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 22 registers, 336 bytes cmem[0], 52 bytes cmem[2]

computation takes 422.96 miliseconds计算耗时 422.96 毫秒

The problem seems to come from loop unrolling .问题似乎来自循环展开

Indeed, the 10-loops case can be trivially unrolled by NVCC since the loop is actually always executed once (thus the for line can be removed with j set to 0).实际上,NVCC 可以轻松展开10-loops情况,因为循环实际上总是执行一次(因此可以在 j 设置为 0 的情况下删除for行)。 The 90-loops case is unrolled by NVCC (there are only 9 actual iterations). NVCC 展开了90-loops的情况(只有 9 次实际迭代)。 The resulting code is thus much bigger but still fast since no branches are performed (GPUs hate branches).因此,生成的代码要大得多,但仍然很快,因为没有执行分支(GPU 讨厌分支)。 However, the 100-loops case is not unrolled by NVCC (you hit a threshold of the compiler optimizer).但是,NVCC不会展开100-loops的情况(您达到了编译器优化器的阈值)。 The resulting code is small, but it leads to more branches being executed at runtime: branching is performed for each executed loop iteration (a total of 10).生成的代码很小,但会导致在运行时执行更多的分支:每次执行的循环迭代都会执行分支(总共 10 次)。 You can see the assembly code difference here .您可以在此处查看汇编代码差异。

You can force unrolling using the directive #pragma unroll .您可以使用指令#pragma unroll强制展开。 However, keep in mind that increasing the size of a code can reduce its performance.但是,请记住,增加代码的大小会降低其性能。

PS: the slightly higher number of register used in the last version may decrease performance, but simulations show that it should be OK in this case. PS:上一版本使用的寄存器数量稍多,可能会降低性能,但模拟表明在这种情况下应该没问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么“使用严格”会在此示例中提高性能10倍? - Why “use strict” improves performance 10x in this example? 数组中不同的浮点值会影响性能 10 倍 - 为什么? - Different float values in array impact performance by 10x - why? Scala quickSort使用Ordering [T]速度慢10倍 - Scala quickSort 10x slower when using Ordering[T] 强制执行默认CPU时,RenderScript速度提高10倍 - RenderScript speedup 10x when forcing default CPU implementation 为什么将 0.1f 更改为 0 会使性能降低 10 倍? - Why does changing 0.1f to 0 slow down performance by 10x? 扩展 String.prototype 性能表明函数调用速度提高了 10 倍 - Extending String.prototype performance shows that function calls are 10x faster Couchbase:在群集模式下运行时,cbs-pillowfight延迟测试相差10倍的可能原因 - Couchbase: possible reasons for 10x difference in cbs-pillowfight latency test, when running in a cluster mode 引擎性能问题。 同一站点从appspot访问速度比从我的域访问快10倍 - Appengine performance problem. Same site 10x faster accessing from appspot than from my domain numpy float:比算术运算中的内置慢10倍? - numpy float: 10x slower than builtin in arithmetic operations? Rails测试db比开发db快10倍 - Rails test db 10x faster than development db
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM