CUDA对数：nvprof以双精度检测单精度运算

Question

I'm computing "log(x)" in double precision in CUDA, but when I profile, it detects single precision operations using metric "flop_count_sp_special". 我正在CUDA中以双精度计算“ log（x）”，但是当我分析时，它使用度量“ flop_count_sp_special”检测到单精度运算。

I'm compiling with "-arch=sm_30" to ensure compute capability 3.0 and double precision arithmetic, but I can't seem to find a way to ensure special functions are computed in double precision. 我正在使用“ -arch = sm_30”进行编译以确保具有3.0的计算能力和双精度算术，但是我似乎找不到一种方法来确保以双精度计算特殊函数。 Is this possible? 这可能吗？

Answer 1

It appears that the CUDA calculation of double-precision floating-point log does involve (a small number of) single precision floating point calculations. 似乎双精度浮点log的CUDA计算确实涉及（少量）单精度浮点计算。 I think this need not be cause for concern, in and of itself. 我认为这本身不必引起关注。

The GPU hardware does not support a native double-precision log calculation, therefore the compiler, upon encountering this, replaces the instruction with a sequence of operations (eg add, multiply, etc.) designed to perform the calculation. GPU硬件不支持本机双精度log计算，因此，编译器在遇到这种情况时，会使用旨在执行计算的一系列操作（例如加，乘等）替换指令。 We can confirm this with a simple example: 我们可以用一个简单的例子来确认这一点：

$ cat t288.cu
#include <math.h>
#include <stdio.h>
__global__ void kernel(double *y)
{
  double x = *y;
  *y = log(x);
}

int main(){

  double *d_y, h_y = 10.0;
  cudaMalloc(&d_y, sizeof(double));
  cudaMemcpy(d_y, &h_y, sizeof(double), cudaMemcpyHostToDevice);
  kernel<<<1,1>>>(d_y);
  cudaDeviceSynchronize();
  cudaMemcpy(&h_y, d_y, sizeof(double), cudaMemcpyDeviceToHost);
  printf("val = %f\n", h_y);
}

$ nvcc -arch=sm_35 -o t288 t288.cu
$ cuda-memcheck ./t288
========= CUDA-MEMCHECK
val = 2.302585
========= ERROR SUMMARY: 0 errors
$ CUDA_VISIBLE_DEVICES="1" nvprof --metrics flop_count_dp,flop_count_sp,flop_count_sp_special ./t288
==14909== NVPROF is profiling process 14909, command: ./t288
==14909== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "kernel(double*)" (done)
val = 2.302585rnal events
==14909== Profiling application: ./t288
==14909== Profiling result:
==14909== Metric result:
Invocations                               Metric Name                                    Metric Description         Min         Max         Avg
Device "Tesla K40m (0)"
    Kernel: kernel(double*)
          1                             flop_count_dp           Floating Point Operations(Double Precision)          44          44          44
          1                             flop_count_sp           Floating Point Operations(Single Precision)           0           0           0
          1                     flop_count_sp_special   Floating Point Operations(Single Precision Special)           1           1           1
$ cuobjdump -sass ./t288

Fatbin elf code:
================
arch = sm_35
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_35

Fatbin elf code:
================
arch = sm_35
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

        code for sm_35
                Function : _Z6kernelPd
        .headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                                                  /* 0x088010fc10a010ac */
        /*0008*/                   MOV R1, c[0x0][0x44];                          /* 0x64c03c00089c0006 */
        /*0010*/                   MOV R2, c[0x0][0x140];                         /* 0x64c03c00281c000a */
        /*0018*/                   MOV R3, c[0x0][0x144];                         /* 0x64c03c00289c000e */
        /*0020*/                   LD.E.64 R12, [R2];                             /* 0xc5800000001c0830 */
        /*0028*/                   MOV32I R0, 0xfffffc01;                         /* 0x747ffffe009fc002 */
        /*0030*/                   ISETP.GT.AND P0, PT, R13, c[0x2][0x0], PT;     /* 0x5b481c40001c341e */
        /*0038*/                   MOV R4, R12;                                   /* 0xe4c03c00061c0012 */
                                                                                  /* 0x08ac1080a0a0a4ac */
        /*0048*/                   MOV R5, R13;                                   /* 0xe4c03c00069c0016 */
        /*0050*/              @!P0 DMUL R4, R4, 1.80143985094819840000e+16;       /* 0xc400021a80201011 */
        /*0058*/              @!P0 MOV R13, R5;                                   /* 0xe4c03c0002a00036 */
        /*0060*/                   IADD32I R6, R13, -0x1;                         /* 0x407fffffff9c3419 */
        /*0068*/                   ISETP.LT.U32.AND P1, PT, R6, c[0x2][0x4], PT;  /* 0x5b101c40009c183e */
        /*0070*/              @!P0 MOV R12, R4;                                   /* 0xe4c03c0002200032 */
        /*0078*/              @!P0 MOV32I R0, 0xfffffbcb;                         /* 0x747ffffde5a3c002 */
                                                                                  /* 0x08b88010a4a010ac */
        /*0088*/               @P1 BRA 0xc0;                                      /* 0x120000001804003c */
        /*0090*/                   MOV32I R8, 0x0;                                /* 0x74000000001fc022 */
        /*0098*/                   MOV32I R9, 0x7ff00000;                         /* 0x743ff800001fc026 */
        /*00a0*/                   DFMA R8, R4, +INF , R8;                        /* 0xb38023ff801c1021 */
        /*00a8*/                   FCMP.NEU R12, R8, RZ, R5;                      /* 0xdd6814007f9c2032 */
        /*00b0*/                   FCMP.NEU R13, R9, -QNAN , R5;                  /* 0xbd6817ff801c2435 */
        /*00b8*/                   BRA 0x240;                                     /* 0x12000000c01c003c */
                                                                                  /* 0x08a010b010a0a010 */
        /*00c8*/                   LOP32I.AND R4, R13, 0x800fffff;                /* 0x204007ffff9c3410 */
        /*00d0*/                   IMAD.U32.U32.HI R0, R13, 0x1000, R0;           /* 0xa2000008001c3401 */
        /*00d8*/                   LOP32I.OR R5, R4, 0x3ff00000;                  /* 0x211ff800001c1014 */
        /*00e0*/                   ISETP.LT.AND P0, PT, R5, c[0x2][0x8], PT;      /* 0x5b181c40011c141e */
        /*00e8*/                   MOV R4, R12;                                   /* 0xe4c03c00061c0012 */
        /*00f0*/              @!P0 IADD32I R7, R5, -0x100000;                     /* 0x407ff8000020141d */
        /*00f8*/              @!P0 IADD32I R0, R0, 0x1;                           /* 0x4000000000a00001 */
                                                                                  /* 0x08a010a01080a010 */
        /*0108*/              @!P0 MOV R5, R7;                                    /* 0xe4c03c0003a00016 */
        /*0110*/                   LOP32I.XOR R12, R0, 0x80000000;                /* 0x22400000001c0030 */
        /*0118*/                   DADD R6, R4, 1;                                /* 0xc38001ff801c1019 */
        /*0120*/                   MOV R8, RZ;                                    /* 0xe4c03c007f9c0022 */
        /*0128*/                   DADD R4, R4, -1;                               /* 0xcb8001ff801c1011 */
        /*0130*/                   MUFU.RCP64H R9, R7;                            /* 0x84000000031c1c26 */
        /*0138*/                   MOV32I R14, 0x8b7a8b04;                        /* 0x7445bd45821fc03a */
                                                                                  /* 0x08a01080a4a4a4a4 */
        /*0148*/                   DFMA R6, -R6, R8, c[0x2][0x10];                /* 0x9b882040021c181a */
        /*0150*/                   DFMA R6, R6, R6, R6;                           /* 0xdb801800031c181a */
        /*0158*/                   DFMA R6, R8, R6, R8;                           /* 0xdb802000031c201a */
        /*0160*/                   DMUL R8, R6, R4;                               /* 0xe4000000021c1822 */
        /*0168*/                   DFMA R8, R6, R4, R8;                           /* 0xdb802000021c1822 */
        /*0170*/                   MOV32I R15, 0x3ed0ee25;                        /* 0x741f6877129fc03e */
        /*0178*/                   MOV32I R13, 0x43300000;                        /* 0x74219800001fc036 */
                                                                                  /* 0x08a080a080a4a4a4 */
        /*0188*/                   DMUL R10, R8, R8;                              /* 0xe4000000041c202a */
        /*0190*/                   DFMA R14, R10, c[0x2][0x18], R14;              /* 0x5b803840031c283a */
        /*0198*/                   DFMA R14, R10, R14, c[0x2][0x20];              /* 0x9b803840041c283a */
        /*01a0*/                   DFMA R14, R10, R14, c[0x2][0x28];              /* 0x9b803840051c283a */
        /*01a8*/                   DADD R16, R4, -R8;                             /* 0xe3810000041c1042 */
        /*01b0*/                   DFMA R14, R10, R14, c[0x2][0x30];              /* 0x9b803840061c283a */
        /*01b8*/                   DADD R18, R16, R16;                            /* 0xe3800000081c404a */
                                                                                  /* 0x088880948880a080 */
        /*01c8*/                   DFMA R14, R10, R14, c[0x2][0x38];              /* 0x9b803840071c283a */
        /*01d0*/                   DADD R12, R12, c[0x2][0x50];                   /* 0x638000400a1c3032 */
        /*01d8*/                   DFMA R16, R10, R14, c[0x2][0x40];              /* 0x9b803840081c2842 */
        /*01e0*/                   DFMA R14, R12, c[0x2][0x58], R8;               /* 0x5b8020400b1c303a */
        /*01e8*/                   DFMA R4, -R4, R8, R18;                         /* 0xdb884800041c1012 */
        /*01f0*/                   DFMA R16, R10, R16, c[0x2][0x48];              /* 0x9b804040091c2842 */
        /*01f8*/                   DFMA R18, -R12, c[0x2][0x58], R14;             /* 0x5b8838400b1c304a */
                                                                                  /* 0x08aca4a4a4a08094 */
        /*0208*/                   DMUL R4, R6, R4;                               /* 0xe4000000021c1812 */
        /*0210*/                   DMUL R10, R10, R16;                            /* 0xe4000000081c282a */
        /*0218*/                   DADD R18, -R8, R18;                            /* 0xe3880000091c204a */
        /*0220*/                   DFMA R4, R8, R10, R4;                          /* 0xdb801000051c2012 */
        /*0228*/                   DADD R18, R4, -R18;                            /* 0xe3810000091c104a */
        /*0230*/                   DFMA R12, R12, c[0x2][0x60], R18;              /* 0x5b8048400c1c3032 */
        /*0238*/                   DADD R12, R14, R12;                            /* 0xe3800000061c3832 */
                                                                                  /* 0x080000000000b810 */
        /*0248*/                   ST.E.64 [R2], R12;                             /* 0xe5800000001c0830 */
        /*0250*/                   EXIT;                                          /* 0x18000000001c003c */
        /*0258*/                   BRA 0x258;                                     /* 0x12007ffffc1c003c */
        /*0260*/                   NOP;                                           /* 0x85800000001c3c02 */
        /*0268*/                   NOP;                                           /* 0x85800000001c3c02 */
        /*0270*/                   NOP;                                           /* 0x85800000001c3c02 */
        /*0278*/                   NOP;                                           /* 0x85800000001c3c02 */
                ......................



Fatbin ptx code:
================
arch = sm_35
code version = [6,2]
producer = cuda
host = linux
compile_size = 64bit
compressed
$

First of all we see that the profiler reports this double precision log op using mostly double-precision calculation but also one single-precision "special" op. 首先，我们看到事件探查器报告此双精度log运算主要使用双精度计算，但也使用一个单精度“特殊”运算。

Looking at the SASS dump, we observe this single-precision instruction: 查看SASS转储，我们观察到以下单精度指令：

        /*0130*/                   MUFU.RCP64H R9, R7;                            /* 0x84000000031c1c26 */

(we can confirm these are single-precision by referring to the documentation ) （我们可以通过参考文档确认这些是单精度的）

Most of the algorithm appears to use double-precision floating point. 大多数算法似乎使用双精度浮点。 This particular step may be involved in some estimation that only requires single-precision. 此特定步骤可能涉及仅需要单精度的某些估计。

So it would appear to be "expected" that use of log may report some (nonzero) flop_count_sp_special metric. 因此，似乎“预期”使用log可能会报告某些（非零） flop_count_sp_special指标。

(Yes, the FCMP instructions are also single-precision, but they appear to be in a code path not taken, for this case.) （是的，FCMP指令也是单精度的，但在这种情况下，它们似乎位于未采用的代码路径中。）

CUDA对数：nvprof以双精度检测单精度运算

问题描述

1 个解决方案

解决方案1
4 已采纳 2018-08-31 18:14:37

CUDA对数：nvprof以双精度检测单精度运算

问题描述

1 个解决方案

解决方案1 4 已采纳 2018-08-31 18:14:37

解决方案1
4 已采纳 2018-08-31 18:14:37