[英]Nvidia CUDA - passing struct by pointer
I have a problem with passing a pointer to the struct to the device function. 将指向结构的指针传递给设备函数时遇到问题。 I want to create a struct in local memory (i know it's slow, it's just an example) and pass it to the other function by pointer.
我想在本地内存中创建一个结构(我知道它很慢,它只是一个例子)并通过指针传递给另一个函数。 The problem is that when i debug it with memcheck on, i get error:
问题是当我用memcheck调试它时,我得到错误:
Program received signal CUDA_EXCEPTION_1, Lane Illegal Address. Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 7, warp 0, lane 0 0x0000000000977608 in foo (st=0x3fffc38) at test.cu:15 15 st->m_tx = 99;
If I debug it without memcheck on, it works fine and gives expected results. 如果我在没有memcheck的情况下调试它,它可以正常工作并给出预期的结果。 My OS is RedHat 6.3 64-bits with Kernel 2.6.32-220.
我的操作系统是RedHat 6.3 64位,内核为2.6.32-220。 I use GTX680, CUDA 5.0 and compile the program with sm=30.
我使用GTX680,CUDA 5.0并用sm = 30编译程序。
Code I used for testing this is below: 我用于测试的代码如下:
typedef struct __align__(8) {
int m_x0;
int m_tx;
} myStruct;
__device__ void foo(myStruct *st) {
st->m_tx = 99;
st->m_x0 = 123;
}
__global__ void myKernel(){
myStruct m_struct ;
m_struct.m_tx = 45;
m_struct.m_x0 = 90;
foo(&m_struct);
}
int main(void) {
myKernel <<<1,1 >>>();
cudaThreadSynchronize();
return 0;
}
Any suggestions? 有什么建议? Thanks for any help.
谢谢你的帮助。
Your example code is completely optimised away by the compiler because none of the code contributes to a global memory write. 编译器完全优化了您的示例代码,因为没有任何代码有助于全局内存写入。 This is easily proved by compiling the kernel to a cubin file and disassembling the result with
cuobjdump
: 通过将内核编译为cubin文件并使用
cuobjdump
反汇编结果可以很容易地证明这一点:
$ nvcc -arch=sm_20 -Xptxas="-v" -cubin struct.cu
ptxas info : Compiling entry function '_Z8myKernelv' for 'sm_20'
ptxas info : Function properties for _Z8myKernelv
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 2 registers, 32 bytes cmem[0]
$ cuobjdump -sass struct_dumb.cubin
code for sm_20
Function : _Z8myKernelv
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x00001de780000000*/ EXIT;
.............................
ie. 即。 the kernel is completely empty.
内核完全是空的。 The debugger can't debug the code you want to investigate because it does not exist in what the compiler/assembler emitted.
调试器无法调试您要调查的代码,因为它在编译器/汇编器发出的内容中不存在。 If we take a few liberties with your code:
如果我们对您的代码采取一些自由:
typedef struct __align__(8) {
int m_x0;
int m_tx;
} myStruct;
__device__ __noinline__ void foo(myStruct *st) {
st->m_tx = 99;
st->m_x0 = 123;
}
__global__ void myKernel(int dowrite, int *output){
myStruct m_struct ;
m_struct.m_tx = 45;
m_struct.m_x0 = 90;
if (dowrite) {
foo(&m_struct);
output[threadIdx.x] = m_struct.m_tx + m_struct.m_x0;
}
}
int main(void) {
int * output;
cudaMalloc((void **)(&output), sizeof(int));
myKernel <<<1,1 >>>(1, output);
cudaThreadSynchronize();
return 0;
}
and repeat the same compilation and disassembly steps, things look somewhat different: 并重复相同的编译和反汇编步骤,事情看起来有些不同:
$ nvcc -arch=sm_20 -Xptxas="-v" -cubin struct_dumb.cu
ptxas info : Compiling entry function '_Z8myKerneliPi' for 'sm_20'
ptxas info : Function properties for _Z8myKerneliPi
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _Z3fooP8myStruct
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 5 registers, 40 bytes cmem[0]
$ /usr/local/cuda/bin/cuobjdump -sass struct_dumb.cubin
code for sm_20
Function : _Z8myKerneliPi
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x20105d034800c000*/ IADD R1, R1, -0x8;
/*0010*/ /*0x68009de218000001*/ MOV32I R2, 0x5a;
/*0018*/ /*0xb400dde218000000*/ MOV32I R3, 0x2d;
/*0020*/ /*0x83f1dc23190e4000*/ ISETP.EQ.AND P0, pt, RZ, c [0x0] [0x20], pt;
/*0028*/ /*0x00101c034800c000*/ IADD R0, R1, 0x0;
/*0030*/ /*0x00109ca5c8000000*/ STL.64 [R1], R2;
/*0038*/ /*0x000001e780000000*/ @P0 EXIT;
/*0040*/ /*0x10011c0348004000*/ IADD R4, R0, c [0x0] [0x4];
/*0048*/ /*0xc001000750000000*/ CAL 0x80;
/*0050*/ /*0x00009ca5c0000000*/ LDL.64 R2, [R0];
/*0058*/ /*0x84011c042c000000*/ S2R R4, SR_Tid_X;
/*0060*/ /*0x90411c4340004000*/ ISCADD R4, R4, c [0x0] [0x24], 0x2;
/*0068*/ /*0x0c201c0348000000*/ IADD R0, R2, R3;
/*0070*/ /*0x00401c8590000000*/ ST [R4], R0;
/*0078*/ /*0x00001de780000000*/ EXIT;
/*0080*/ /*0x8c00dde218000001*/ MOV32I R3, 0x63;
/*0088*/ /*0xec009de218000001*/ MOV32I R2, 0x7b;
/*0090*/ /*0x1040dc8590000000*/ ST [R4+0x4], R3;
/*0098*/ /*0x00409c8590000000*/ ST [R4], R2;
/*00a0*/ /*0x00001de790000000*/ RET;
...............................
we get actual code in the assembler output. 我们在汇编程序输出中获得实际代码。 You might have more luck in the debugger with that.
你可能在调试器中有更多的运气。
I am from the CUDA developer tools team. 我来自CUDA开发人员工具团队。 When compiled for device side debug (ie -G), the original code will not be optimized out.
当编译用于设备端调试(即-G)时,原始代码将不会被优化。 The issue looks like a memcheck bug.
这个问题看起来像是一个memcheck错误。 Thank you for finding this.
谢谢你找到这个。 We will look into it.
我们会研究一下。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.