cuda内核中的除法运算对每个线程的寄存器数的影响

Question

Im was writing a program which includes a cuda kernel. 我正在写一个包含cuda内核的程序。 I found that if you are using #define OPERATOR * one thread will use 11 registers, but I you will use #define OPERATOR / (division operator) one thread will use 52 registers!! 我发现如果使用#define OPERATOR *一个线程将使用11个寄存器，但是我将使用#define OPERATOR / （除法运算符）一个线程将使用52个寄存器！ Whats wrong? 怎么了？ I must decrease register number (I dot want to set maxregcount)! 我必须减少寄存器号（我想设置maxregcount）！ How can I decrease number of registers when Im using devision operator in cuda kernel? 我在cuda内核中使用devision运算符时如何减少寄存器数量？

#include <stdio.h>
#include <stdlib.h>
#define GRID_SIZE 1
#define BLOCK_SIZE 1
#define OPERATOR /
__global__ void kernel(double* array){
    for (int curEl=0;curEl<BLOCK_SIZE;++curEl){
    array[curEl]=array[curEl] OPERATOR 10;
    }
}
int main(void) {
    double *devPtr=NULL,*data=(double*)malloc(sizeof(double)*BLOCK_SIZE);
    cudaFuncAttributes cudaFuncAttr;
    cudaFuncGetAttributes(&cudaFuncAttr,kernel);
    for (int curElem=0;curElem<BLOCK_SIZE;++curElem){
        data[curElem]=curElem;
    }
    cudaMalloc(&devPtr,sizeof(double)*BLOCK_SIZE);
    cudaMemcpy(devPtr,data,sizeof(double)*BLOCK_SIZE,cudaMemcpyHostToDevice);
    kernel<<<1,BLOCK_SIZE>>>(devPtr);
    printf("1 thread needs %d regs\n",cudaFuncAttr.numRegs);
    return 0;
}

Answer 1

The increase in register use when switching from a double-precision multiplication to a double-precision division in kernel computation is due to the fact that double-precision multiplication is a built-in hardware instruction, while double-precision division is a sizable called software subroutine (that is, a function call of sorts). 从内核计算中的双精度乘法转换为双精度除法时，寄存器使用的增加是由于以下事实：双精度乘法是内置的硬件指令，而双精度除法是相当大的称为软件子例程（即某种函数调用）。 This is easily verified by inspection of the generated machine code (SASS) with cuobjdump --dump-sass . 通过使用cuobjdump --dump-sass检查生成的机器代码（SASS）可以很容易地验证这一点。

The reason that double-precision divisions (and in fact all divisions, including single-precision division and integer division) are emulated either by inline code or called subroutines is due to the fact that the GPU hardware has no direct support for division operations, in order to keep the individual computational cores ("CUDA cores") as simple and as small as possible, which ultimately leads to higher peak performance for a given size chip. 双精度除法（实际上是所有除法，包括单精度除法和整数除法）都由内联代码或称为子例程进行仿真的原因是因为GPU硬件不直接支持除法运算，为了使各个计算核心（“ CUDA核心”）保持尽可能简单和尽可能小，最终会导致给定大小的芯片具有更高的峰值性能。 It likely also improves the efficiency of the cores as measured by the GFLOPS/watt metric. 通过GFLOPS /瓦特度量标准，它还可能提高磁芯的效率。

For release builds, the typical increase in register use caused by the introduction of double-precision division is around 26 registers. 对于发行版本，由于引入双精度除法而导致的寄存器使用的典型增加是大约26个寄存器。 These additional registers are needed to store intermediate variables in the division computation, where each double-precision temporary variable requires two 32-bit registers. 需要这些附加寄存器来存储除法运算中的中间变量，其中每个双精度临时变量都需要两个32位寄存器。

As Marco13 points out in a comment above, it may be possible to manually replace division by multiplication with the reciprocal. 正如Marco13在上面的评论中指出的那样，可以通过乘以倒数来手动替换除法。 However, this causes slight numerical differences in most cases, which is why the CUDA compiler does not apply this transformation automatically. 但是，这在大多数情况下会导致数值上的细微差异，因此CUDA编译器不会自动应用此转换。

Generally speaking, register use can be controlled with compilation-unit granularity through the -maxrregcount nvcc compiler flag , or with per-function granularity using the __launch_bounds__ function attribute . 一般而言，可以通过-maxrregcount nvcc编译器标志通过编译单元粒度来控制寄存器的使用，或者使用__launch_bounds__ 函数属性来按功能来控制寄存器的使用。 However, forcing lower register use by more than a few registers below the level determined by the compiler frequently leads to register spilling in the generated code, which usually has a negative impact on kernel performance. 但是，在编译器确定的级别之下强制多个寄存器使用较少的寄存器会经常导致寄存器溢出到生成的代码中，这通常会对内核性能产生负面影响。

cuda内核中的除法运算对每个线程的寄存器数的影响

问题描述

1 个解决方案

解决方案1
5 已采纳 2014-08-05 16:06:34

cuda内核中的除法运算对每个线程的寄存器数的影响

问题描述

1 个解决方案

解决方案1 5 已采纳 2014-08-05 16:06:34

解决方案1
5 已采纳 2014-08-05 16:06:34