简体   繁体   English

CUDA C在双打时使用单精度翻牌

[英]CUDA C using single precision flop on doubles

The problem 问题

During a project in CUDA C, I came across unexpected behaviour regarding single precision and double precision floating point operations. 在CUDA C的一个项目中,我遇到了关于单精度和双精度浮点运算的意外行为。 In the project, I first fill an array with number in a kernel and in another kernel, I do some computation on these numbers. 在项目中,我首先使用内核和另一个内核中的数字填充数组,然后对这些数字进行一些计算。 All variables and arrays are double precision, so I would not expect any single precision floating point operation to happen. 所有变量和数组都是双精度的,所以我不希望任何单精度浮点运算发生。 However, if I analyze the executable of the program using NVPROF, it shows that single precision operations are executed. 但是,如果我使用NVPROF分析程序的可执行文件,则表明执行了单精度操作。 How is this possible? 这怎么可能?

Minimal, Complete, and Verifiable example 最小,完整和可验证的例子

Here is the smallest program, that shows this behaviour on my architecture: (asserts and error catching has been left out). 这是最小的程序,它在我的架构中显示了这种行为:(断言和错误捕获已被忽略)。 I use a Nvidia Tesla k40 graphics card. 我使用的是Nvidia Tesla k40显卡。

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define Nx 10
#define Ny 10
#define RANDOM double(0.236954587566)

__global__ void test(double *array, size_t pitch){
    double rho, u;
    int x = threadIdx.x + blockDim.x*blockIdx.x;
    int y = threadIdx.y + blockDim.y*blockIdx.y;
    int idx = y*(pitch/sizeof(double)) + 2*x;

    if(x < Nx && y < Ny){
        rho = array[idx]; 
        u = array[idx+1]/rho;
        array[idx] = rho*u;
    }
}

__global__ void fill(double *array, size_t pitch){
    int x = threadIdx.x + blockIdx.x * blockDim.x;
    int y = threadIdx.y + blockIdx.y * blockDim.y;  
    int idx = y*(pitch/sizeof(double)) + 2*x;

    if(x < Nx || y < Ny){
        array[idx] = RANDOM*idx;
        array[idx + 1] = idx*idx*RANDOM;
    }
}

int main(int argc, char* argv[]) {
    double *d_array;
    size_t pitch;
    cudaMallocPitch((void **) &d_array, &pitch, 2*Nx*sizeof(double), Ny);

    dim3 threadDistribution = dim3(8,8);
    dim3 blockDistribution = dim3( (Nx + threadDistribution.x - 1) / (threadDistribution.x), (Ny + threadDistribution.y) / (threadDistribution.y));
    fill <<< blockDistribution, threadDistribution >>> (d_array, pitch);
    cudaDeviceSynchronize();
    test <<< blockDistribution, threadDistribution >>> (d_array, pitch);

    return 0;
}

The output of NVPROF (edited to make it more readable, if you need the full output, just ask in the comments): NVPROF的输出(编辑使其更具可读性,如果您需要完整输出,只需在评论中询问):

....
Device "Tesla K40c (0)"
Kernel: test(double*, unsigned long)
      Metric Name             Min         Max         Avg
      flop_count_sp           198         198         198
      flop_count_sp_add         0           0           0
      flop_count_sp_mul         0           0           0
      flop_count_sp_fma        99          99          99
      flop_count_sp_special   102         102         102
      flop_count_dp          1214        1214        1214
      flop_count_dp_add         0           0           0
      flop_count_dp_mul       204         204         204
      flop_count_dp_fma       505         505         505

What I've found so far 到目前为止我发现了什么

I found that if I delete the division in line 16: 我发现如果我删除第16行的分部:

u = array[idx+1]/rho;
==>
u = array[idx+1];

the output is as expected: zero single precision operations and exactly 100 double precision operations are executed. 输出与预期一致:执行零精度操作和100个双精度操作。 Does anyone know why the division causes the program to use single precision flop and 10 times more double precision floating point operations? 有谁知道为什么除法导致程序使用单精度翻转和双倍精度浮点运算10倍? I've also tried using intrinsics (__ddiv_rn), but this didn't solve the problem. 我也尝试过使用内在函数(__ddiv_rn),但这并没有解决问题。

Many thanks in advance! 提前谢谢了!

Edit - Working solution 编辑 - 工作解决方案

Altough I still haven't figured out why it uses the single precision, I have found a 'solution' to this problem, thanks to @EOF. 尽管我仍然没有弄清楚为什么它使用单精度,我已经找到了解决这个问题的“解决方案”,感谢@EOF。 Replacing the division by multiplication with the reciprocal of rho did the job: 通过乘以rho的倒数代替除法完成了这项工作:

u = array[idx+1]/rho;
==>
u = array[idx+1]*__drcp_rn(rho);

As others have pointed out, CUDA devices do not have instructions for floating point division in hardware. 正如其他人所指出的那样,CUDA设备没有硬件浮点除法的指令。 Instead they start from an initial approximation to the reciprocal of the denominator, provided by a single precision special function unit. 相反,它们从分母的倒数的初始近似开始,由单个精度特殊功能单元提供。 It's product with the numerator is then iteratively refined until it matches the fraction to within machine precision. 然后对具有分子的乘积进行迭代求精,直到它与机器精度内的分数匹配。

Even the __ddiv_rn() intrinsic is compiled to this instruction sequence by ptxas, so it's use makes no difference. 甚至__ddiv_rn()内在函数也被ptxas编译为此指令序列,因此它的使用没有区别。

You can gain closer insight by inspecting the code yourself using cuobjdump -sass , although this is made difficult by no official documentation for shader assembly being available other than the bare list of instructions . 您可以通过使用cuobjdump -sass自己检查代码来获得更深入的洞察力,尽管由于除了简单的指令列表之外没有可用的着色器程序集的官方文档,这很难实现。

I'll use the following bare-bones division kernel as an example: 我将使用以下的裸骨分区内核作为示例:

__global__ void div(double x, double y, double *z) {
    *z = x / y;
}

This is compiled to the following shader assembly for a compute capability 3.5 device: 对于计算能力3.5设备,这将编译为以下着色器程序集:

    Function : _Z3divddPd
.headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                                                          /* 0x08a0109c10801000 */
    /*0008*/                   MOV R1, c[0x0][0x44];                                      /* 0x64c03c00089c0006 */
    /*0010*/                   MOV R0, c[0x0][0x14c];                                     /* 0x64c03c00299c0002 */
    /*0018*/                   MOV32I R2, 0x1;                                            /* 0x74000000009fc00a */
    /*0020*/                   MOV R8, c[0x0][0x148];                                     /* 0x64c03c00291c0022 */
    /*0028*/                   MOV R9, c[0x0][0x14c];                                     /* 0x64c03c00299c0026 */
    /*0030*/                   MUFU.RCP64H R3, R0;                                        /* 0x84000000031c000e */
    /*0038*/                   MOV32I R0, 0x35b7333;                                      /* 0x7401adb9999fc002 */
                                                                                          /* 0x08a080a080a4a4a4 */
    /*0048*/                   DFMA R4, -R8, R2, c[0x2][0x0];                             /* 0x9b880840001c2012 */
    /*0050*/                   DFMA R4, R4, R4, R4;                                       /* 0xdb801000021c1012 */
    /*0058*/                   DFMA R4, R4, R2, R2;                                       /* 0xdb800800011c1012 */
    /*0060*/                   DMUL R6, R4, c[0x0][0x140];                                /* 0x64000000281c101a */
    /*0068*/                   FSETP.GE.AND P0, PT, R0, |c[0x0][0x144]|, PT;              /* 0x5db09c00289c001e */
    /*0070*/                   DFMA R8, -R8, R6, c[0x0][0x140];                           /* 0x9b881800281c2022 */
    /*0078*/                   MOV R2, c[0x0][0x150];                                     /* 0x64c03c002a1c000a */
                                                                                          /* 0x0880acb0a0ac8010 */
    /*0088*/                   MOV R3, c[0x0][0x154];                                     /* 0x64c03c002a9c000e */
    /*0090*/                   DFMA R4, R8, R4, R6;                                       /* 0xdb801800021c2012 */
    /*0098*/               @P0 BRA 0xb8;                                                  /* 0x120000000c00003c */
    /*00a0*/                   FFMA R0, RZ, c[0x0][0x14c], R5;                            /* 0x4c001400299ffc02 */
    /*00a8*/                   FSETP.GT.AND P0, PT, |R0|, c[0x2][0x8], PT;                /* 0x5da01c40011c021e */
    /*00b0*/               @P0 BRA 0xe8;                                                  /* 0x120000001800003c */
    /*00b8*/                   MOV R4, c[0x0][0x140];                                     /* 0x64c03c00281c0012 */
                                                                                          /* 0x08a1b810b8008010 */
    /*00c8*/                   MOV R5, c[0x0][0x144];                                     /* 0x64c03c00289c0016 */
    /*00d0*/                   MOV R7, c[0x0][0x14c];                                     /* 0x64c03c00299c001e */
    /*00d8*/                   MOV R6, c[0x0][0x148];                                     /* 0x64c03c00291c001a */
    /*00e0*/                   CAL 0xf8;                                                  /* 0x1300000008000100 */
    /*00e8*/                   ST.E.64 [R2], R4;                                          /* 0xe5800000001c0810 */
    /*00f0*/                   EXIT;                                                      /* 0x18000000001c003c */
    /*00f8*/                   LOP32I.AND R0, R7, 0x40000000;                             /* 0x20200000001c1c00 */
                                                                                          /* 0x08a08010a010b010 */
    /*0108*/                   MOV32I R15, 0x1ff00000;                                    /* 0x740ff800001fc03e */
    /*0110*/                   ISETP.LT.U32.AND P0, PT, R0, c[0x2][0xc], PT;              /* 0x5b101c40019c001e */
    /*0118*/                   MOV R8, RZ;                                                /* 0xe4c03c007f9c0022 */
    /*0120*/                   SEL R9, R15, c[0x2][0x10], !P0;                            /* 0x65002040021c3c26 */
    /*0128*/                   MOV32I R12, 0x1;                                           /* 0x74000000009fc032 */
    /*0130*/                   DMUL R10, R8, R6;                                          /* 0xe4000000031c202a */
    /*0138*/                   LOP32I.AND R0, R5, 0x7f800000;                             /* 0x203fc000001c1400 */
                                                                                          /* 0x08a0108ca01080a0 */
    /*0148*/                   MUFU.RCP64H R13, R11;                                      /* 0x84000000031c2c36 */
    /*0150*/                   DFMA R16, -R10, R12, c[0x2][0x0];                          /* 0x9b883040001c2842 */
    /*0158*/                   ISETP.LT.U32.AND P0, PT, R0, c[0x2][0x14], PT;             /* 0x5b101c40029c001e */
    /*0160*/                   MOV R14, RZ;                                               /* 0xe4c03c007f9c003a */
    /*0168*/                   DFMA R16, R16, R16, R16;                                   /* 0xdb804000081c4042 */
    /*0170*/                   SEL R15, R15, c[0x2][0x10], !P0;                           /* 0x65002040021c3c3e */
    /*0178*/                   SSY 0x3a0;                                                 /* 0x1480000110000000 */
                                                                                          /* 0x08acb4a4a4a4a480 */
    /*0188*/                   DMUL R14, R14, R4;                                         /* 0xe4000000021c383a */
    /*0190*/                   DFMA R12, R16, R12, R12;                                   /* 0xdb803000061c4032 */
    /*0198*/                   DMUL R16, R14, R12;                                        /* 0xe4000000061c3842 */
    /*01a0*/                   DFMA R10, -R10, R16, R14;                                  /* 0xdb883800081c282a */
    /*01a8*/                   DFMA R10, R10, R12, R16;                                   /* 0xdb804000061c282a */
    /*01b0*/                   DSETP.LEU.AND P0, PT, |R10|, RZ, PT;                       /* 0xdc581c007f9c2a1e */
    /*01b8*/              @!P0 BRA 0x1e0;                                                 /* 0x120000001020003c */
                                                                                          /* 0x088010b010b8acb4 */
    /*01c8*/                   DSETP.EQ.AND P0, PT, R10, RZ, PT;                          /* 0xdc101c007f9c281e */
    /*01d0*/              @!P0 BRA 0x358;                                                 /* 0x12000000c020003c */
    /*01d8*/                   DMUL.S R8, R4, R6;                                         /* 0xe4000000035c1022 */
    /*01e0*/                   ISETP.GT.U32.AND P0, PT, R0, c[0x2][0x18], PT;             /* 0x5b401c40031c001e */
    /*01e8*/                   MOV32I R0, 0x1ff00000;                                     /* 0x740ff800001fc002 */
    /*01f0*/                   MOV R14, RZ;                                               /* 0xe4c03c007f9c003a */
    /*01f8*/                   SEL R15, R0, c[0x2][0x10], !P0;                            /* 0x65002040021c003e */
                                                                                          /* 0x08b4a49c849c849c */
    /*0208*/                   DMUL R12, R10, R8;                                         /* 0xe4000000041c2832 */
    /*0210*/                   DMUL R18, R10, R14;                                        /* 0xe4000000071c284a */
    /*0218*/                   DMUL R10, R12, R14;                                        /* 0xe4000000071c302a */
    /*0220*/                   DMUL R16, R8, R18;                                         /* 0xe4000000091c2042 */
    /*0228*/                   DFMA R8, R10, R6, -R4;                                     /* 0xdb901000031c2822 */
    /*0230*/                   DFMA R12, R16, R6, -R4;                                    /* 0xdb901000031c4032 */
    /*0238*/                   DSETP.GT.AND P0, PT, |R8|, |R12|, PT;                      /* 0xdc209c00061c221e */
                                                                                          /* 0x08b010ac10b010a0 */
    /*0248*/                   SEL R9, R17, R11, P0;                                      /* 0xe5000000059c4426 */
    /*0250*/                   FSETP.GTU.AND P1, PT, |R9|, 1.469367938527859385e-39, PT;  /* 0xb5e01c00801c263d */
    /*0258*/                   MOV R11, R9;                                               /* 0xe4c03c00049c002e */
    /*0260*/                   SEL R8, R16, R10, P0;                                      /* 0xe5000000051c4022 */
    /*0268*/               @P1 NOP.S;                                                     /* 0x8580000000443c02 */
    /*0270*/                   FSETP.LT.AND P0, PT, |R5|, 1.5046327690525280102e-36, PT;  /* 0xb5881c20001c161d */
    /*0278*/                   MOV32I R0, 0x3ff00000;                                     /* 0x741ff800001fc002 */
                                                                                          /* 0x0880a48090108c10 */
    /*0288*/                   MOV R16, RZ;                                               /* 0xe4c03c007f9c0042 */
    /*0290*/                   SEL R17, R0, c[0x2][0x1c], !P0;                            /* 0x65002040039c0046 */
    /*0298*/                   LOP.OR R10, R8, 0x1;                                       /* 0xc2001000009c2029 */
    /*02a0*/                   LOP.AND R8, R8, -0x2;                                      /* 0xca0003ffff1c2021 */
    /*02a8*/                   DMUL R4, R16, R4;                                          /* 0xe4000000021c4012 */
    /*02b0*/                   DMUL R6, R16, R6;                                          /* 0xe4000000031c401a */
    /*02b8*/                   DFMA R14, R10, R6, -R4;                                    /* 0xdb901000031c283a */
                                                                                          /* 0x08b010b010a0b4a4 */
    /*02c8*/                   DFMA R12, R8, R6, -R4;                                     /* 0xdb901000031c2032 */
    /*02d0*/                   DSETP.GT.AND P0, PT, |R12|, |R14|, PT;                     /* 0xdc209c00071c321e */
    /*02d8*/                   SEL R8, R10, R8, P0;                                       /* 0xe5000000041c2822 */
    /*02e0*/                   LOP.AND R0, R8, 0x1;                                       /* 0xc2000000009c2001 */
    /*02e8*/                   IADD R11.CC, R8, -0x1;                                     /* 0xc88403ffff9c202d */
    /*02f0*/                   ISETP.EQ.U32.AND P0, PT, R0, 0x1, PT;                      /* 0xb3201c00009c001d */
    /*02f8*/                   IADD.X R0, R9, -0x1;                                       /* 0xc88043ffff9c2401 */
                                                                                          /* 0x08b4a480a010b010 */
    /*0308*/                   SEL R10, R11, R8, !P0;                                     /* 0xe5002000041c2c2a */
    /*0310*/               @P0 IADD R8.CC, R8, 0x1;                                       /* 0xc084000000802021 */
    /*0318*/                   SEL R11, R0, R9, !P0;                                      /* 0xe5002000049c002e */
    /*0320*/               @P0 IADD.X R9, R9, RZ;                                         /* 0xe08040007f802426 */
    /*0328*/                   DFMA R14, R10, R6, -R4;                                    /* 0xdb901000031c283a */
    /*0330*/                   DFMA R4, R8, R6, -R4;                                      /* 0xdb901000031c2012 */
    /*0338*/                   DSETP.GT.AND P0, PT, |R4|, |R14|, PT;                      /* 0xdc209c00071c121e */
                                                                                          /* 0x08b4acb4a010b810 */
    /*0348*/                   SEL R8, R10, R8, P0;                                       /* 0xe5000000041c2822 */
    /*0350*/                   SEL.S R9, R11, R9, P0;                                     /* 0xe500000004dc2c26 */
    /*0358*/                   MOV R8, RZ;                                                /* 0xe4c03c007f9c0022 */
    /*0360*/                   MUFU.RCP64H R9, R7;                                        /* 0x84000000031c1c26 */
    /*0368*/                   DSETP.GT.AND P0, PT, |R8|, RZ, PT;                         /* 0xdc201c007f9c221e */
    /*0370*/               @P0 BRA.U 0x398;                                               /* 0x120000001000023c */
    /*0378*/              @!P0 DSETP.NEU.AND P1, PT, |R6|, +INF , PT;                     /* 0xb4681fff80201a3d */
                                                                                          /* 0x0800b8a010ac0010 */
    /*0388*/              @!P0 SEL R9, R7, R9, P1;                                        /* 0xe500040004a01c26 */
    /*0390*/              @!P0 SEL R8, R6, RZ, P1;                                        /* 0xe50004007fa01822 */
    /*0398*/                   DMUL.S R8, R8, R4;                                         /* 0xe4000000025c2022 */
    /*03a0*/                   MOV R4, R8;                                                /* 0xe4c03c00041c0012 */
    /*03a8*/                   MOV R5, R9;                                                /* 0xe4c03c00049c0016 */
    /*03b0*/                   RET;                                                       /* 0x19000000001c003c */
    /*03b8*/                   BRA 0x3b8;                                                 /* 0x12007ffffc1c003c */

The MUFU.RCP64H instruction provides the initial approximation of the reciprocal. MUFU.RCP64H指令提供倒数的初始近似值。 It operates on the high 32 bits of the denominator ( y ) and provides the high 32 bits of the double precision approximation, and therefor is counted as a Floating Point Operations (Single Precision Special) by the profiler. 它在分母( y )的高32位上运行,并提供双精度近似的高32位,因此由分析器计算为浮点运算(单精度特殊) There is another single precision FFMA instruction further down apparently used as a high-throughput version of testing a conditional where full precision isn't required. 还有另一个单精度FFMA指令显然用作测试条件的高吞吐量版本,不需要全精度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM