[英]CUDA's warp shuffle for double-precision data
A CUDA program should do reduction for double-precision data, I use Julien Demouth's slides named "Shuffle: Tips and Tricks" CUDA程序应该减少双精度数据,我使用Julien Demouth的幻灯片“Shuffle:Tips and Tricks”
the shuffle function is below: shuffle功能如下:
/*for shuffle of double-precision point */
__device__ __inline__ double shfl(double x, int lane)
{
int warpSize = 32;
// Split the double number into 2 32b registers.
int lo, hi;
asm volatile("mov.b32 {%0,%1}, %2;":"=r"(lo),"=r"(hi):"d"(x));
// Shuffle the two 32b registers.
lo = __shfl_xor(lo,lane,warpSize);
hi = __shfl_xor(hi,lane,warpSize);
// Recreate the 64b number.
asm volatile("mov.b64 %0,{%1,%2};":"=d"(x):"r"(lo),"r"(hi));
return x;
}
At present, I got the errors below while compiling the program. 目前,我在编译程序时遇到了以下错误。
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 71; error : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 271; error : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 287; error : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 302; error : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 317; error : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 332; error : Arguments mismatch for instruction 'mov'
ptxas fatal : Ptx assembly aborted due to errors
make: *** [csr_double] error 255
Could someone give some advice? 有人可以提一些建议吗?
There is a syntax error in the inline assembly instruction for the load of the double argument to 32 bit registers. 内联汇编指令中存在语法错误,用于将32位寄存器的double参数加载。 This:
这个:
asm volatile("mov.b32 {%0,%1}, %2;":"=r"(lo),"=r"(hi):"d"(x));
should be: 应该:
asm volatile("mov.b64 {%0,%1}, %2;":"=r"(lo),"=r"(hi):"d"(x));
Using a "d" (ie 64 bit floating point register) as the source in a 32 bit load is illegal (and a mov.b32 makes no sense here, the code must load 64 bits to two 32 bit registers). 在32位加载中使用“d”(即64位浮点寄存器)作为源是非法的(并且mov.b32在这里没有意义,代码必须将64位加载到两个32位寄存器)。
As of CUDA 9.0, __shfl
, __shfl_up
, __shfl_down
and __shfl_xor
have been deprecated. 截至CUDA 9.0,已弃用
__shfl
, __shfl_up
, __shfl_down
和__shfl_xor
。
The newly introduced functions __shfl_sync
, __shfl_up_sync
, __shfl_down_sync
and __shfl_xor_sync
have the following prototypes: 新引入的函数
__shfl_sync
, __shfl_up_sync
, __shfl_down_sync
和__shfl_xor_sync
具有以下原型:
T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);
T __shfl_up_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
T __shfl_down_sync(unsigned mask, T var, unsigned int delta, int
width=warpSize);
T __shfl_xor_sync(unsigned mask, T var, int laneMask, int width=warpSize);
where T
can be int
, unsigned int
, long
, unsigned long
, long long
, unsigned long long
, float
or double
. 其中
T
可以是int
, unsigned int
, long
, unsigned long
, long long
, unsigned long long
, float
或double
。
You no longer need to create your own shuffle instructions for double-precision arithmetics. 您不再需要为双精度算术创建自己的shuffle指令。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.