CUDA用于双精度数据的扭曲

Question

A CUDA program should do reduction for double-precision data, I use Julien Demouth's slides named "Shuffle: Tips and Tricks" CUDA程序应该减少双精度数据，我使用Julien Demouth的幻灯片“Shuffle：Tips and Tricks”

the shuffle function is below: shuffle功能如下：

/*for shuffle of double-precision point */
__device__ __inline__ double shfl(double x, int lane)
{
    int warpSize = 32;
    // Split the double number into 2 32b registers.
    int lo, hi;
    asm volatile("mov.b32 {%0,%1}, %2;":"=r"(lo),"=r"(hi):"d"(x));
    // Shuffle the two 32b registers.
    lo = __shfl_xor(lo,lane,warpSize);
    hi = __shfl_xor(hi,lane,warpSize);
    // Recreate the 64b number.
    asm volatile("mov.b64 %0,{%1,%2};":"=d"(x):"r"(lo),"r"(hi));
    return x;
}

At present, I got the errors below while compiling the program. 目前，我在编译程序时遇到了以下错误。

ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 71; error   : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 271; error   : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 287; error   : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 302; error   : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 317; error   : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 332; error   : Arguments mismatch for instruction 'mov'
ptxas fatal   : Ptx assembly aborted due to errors
make: *** [csr_double] error 255

Could someone give some advice? 有人可以提一些建议吗？

Answer 1

There is a syntax error in the inline assembly instruction for the load of the double argument to 32 bit registers. 内联汇编指令中存在语法错误，用于将32位寄存器的double参数加载。 This: 这个：

asm volatile("mov.b32 {%0,%1}, %2;":"=r"(lo),"=r"(hi):"d"(x));

should be: 应该：

asm volatile("mov.b64 {%0,%1}, %2;":"=r"(lo),"=r"(hi):"d"(x));

Using a "d" (ie 64 bit floating point register) as the source in a 32 bit load is illegal (and a mov.b32 makes no sense here, the code must load 64 bits to two 32 bit registers). 在32位加载中使用“d”（即64位浮点寄存器）作为源是非法的（并且mov.b32在这里没有意义，代码必须将64位加载到两个32位寄存器）。

Answer 2

As of CUDA 9.0, __shfl , __shfl_up , __shfl_down and __shfl_xor have been deprecated. 截至CUDA 9.0，已弃用__shfl ， __shfl_up ， __shfl_down和__shfl_xor 。

The newly introduced functions __shfl_sync , __shfl_up_sync , __shfl_down_sync and __shfl_xor_sync have the following prototypes: 新引入的函数__shfl_sync ， __shfl_up_sync ， __shfl_down_sync和__shfl_xor_sync具有以下原型：

T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);
T __shfl_up_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
T __shfl_down_sync(unsigned mask, T var, unsigned int delta, int
width=warpSize);
T __shfl_xor_sync(unsigned mask, T var, int laneMask, int width=warpSize);

where T can be int , unsigned int , long , unsigned long , long long , unsigned long long , float or double . 其中T可以是int ， unsigned int ， long ， unsigned long ， long long ， unsigned long long ， float或double 。

You no longer need to create your own shuffle instructions for double-precision arithmetics. 您不再需要为双精度算术创建自己的shuffle指令。

CUDA用于双精度数据的扭曲

问题描述

2 个解决方案

解决方案1
4 已采纳 2014-06-07 08:52:23

解决方案2
3 2018-03-08 07:37:29

CUDA用于双精度数据的扭曲

问题描述

2 个解决方案

解决方案1 4 已采纳 2014-06-07 08:52:23

解决方案2 3 2018-03-08 07:37:29

解决方案1
4 已采纳 2014-06-07 08:52:23

解决方案2
3 2018-03-08 07:37:29