[英]getting “multiple definition” errors with simple device function in CUDA C
我有一个由2个CUDA文件组成的简单脚本: main.cu和kernel.cu 。 他们的目标是计算2个向量的总和。
// main.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <cuda.h>
#include "kernel.cu"
int main(){
/* Error code to check return values for CUDA calls */
cudaError_t err = cudaSuccess;
srand(time(NULL));
int count = 100;
int A[count], B[count];
int *h_A, *h_B;
h_A = A; h_B = B;
int i;
for(i=0;i<count;i++){
*(h_A+i) = rand() % count; /* Oppure: h_A[i] = rand() % count; */
*(h_B+i) = rand() % count; /* Oppure: h_B[i] = rand() % count; */
}
/* Display dei vettori A e B. */
printf("\nPrimi cinque valori di A = ");
for(i=0;i<4;i++){printf("%d ", A[i]);}
printf("\nPrimi cinque valori di B = ");
for(i=0;i<4;i++){printf("%d ", B[i]);}
int *d_A, *d_B;
err = cudaMalloc((void**)&d_A, count*sizeof(int));
if (err != cudaSuccess){fprintf(stderr, "Failed to allocate device vector A (error code %s)! \n", cudaGetErrorString(err));exit(EXIT_FAILURE);}
err = cudaMalloc((void**)&d_B, count*sizeof(int));
if (err != cudaSuccess){fprintf(stderr, "Failed to allocate device vector A (error code %s)! \n", cudaGetErrorString(err));exit(EXIT_FAILURE);}
err = cudaMemcpy(d_A, A, count*sizeof(int), cudaMemcpyHostToDevice);
if (err != cudaSuccess){fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!\n", cudaGetErrorString(err));exit(EXIT_FAILURE);}
err = cudaMemcpy(d_B, B, count*sizeof(int), cudaMemcpyHostToDevice);
if (err != cudaSuccess){fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!\n", cudaGetErrorString(err));exit(EXIT_FAILURE);}
int numThreads = 256;
int numBlocks = count/numThreads + 1;
AddInts<<<numBlocks,numThreads>>>(d_A,d_B); err = cudaGetLastError();
err = cudaMemcpy(A, d_A, count*sizeof(int), cudaMemcpyDeviceToHost);
if (err != cudaSuccess){fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!\n", cudaGetErrorString(err));exit(EXIT_FAILURE);}
err = cudaFree(d_A);
if (err != cudaSuccess){fprintf(stderr, "Failed to free device vector A (error code %s)!\n", cudaGetErrorString(err));exit(EXIT_FAILURE);}
err = cudaFree(d_B);
if (err != cudaSuccess){fprintf(stderr, "Failed to free device vector A (error code %s)!\n", cudaGetErrorString(err));exit(EXIT_FAILURE);}
printf("\nPrimi cinque valori di A = ");
for(i=0;i<4;i++){printf("%d ", A[i]);}
printf("\n");
return 0;}
这是kernel.cu文件:
// kernel.cu
__device__ int get_global_index(){
return (blockIdx.x * blockDim.x) + threadIdx.x;
}
__global__ void AddInts(int *a, int *b){
int ID = get_global_index();
*(a+ID) += *(b+ID);
}
我100%肯定main.cu脚本是正确的; 我也知道我可以直接在主脚本中添加内核,但这并不是我测试的目的。 我也知道我可以摆脱__device__
函数并将其直接放在__global__
内,但这也不是我的意图。
当我通过在终端中键入nvcc main.cu kernel.cu
编译测试时,我收到以下错误消息:
/tmp/tmpxft_0000248b_00000000-30_kernel.o: In function `get_global_index()':
tmpxft_0000248b_00000000-8_kernel.cudafe1.cpp:(.text+0x15): multiple definition of ` get_global_index()'
/tmp/tmpxft_0000248b_00000000-21_main.o:tmpxft_0000248b_00000000-3_main.cudafe1.cpp:(.text+0x15): first defined here
/tmp/tmpxft_0000248b_00000000-30_kernel.o: In function `__device_stub__Z7AddIntsPiS_(int*, int*)':
tmpxft_0000248b_00000000-8_kernel.cudafe1.cpp:(.text+0x7c): multiple definition of `__device_stub__Z7AddIntsPiS_(int*, int*)'
/tmp/tmpxft_0000248b_00000000-21_main.o:tmpxft_0000248b_00000000-3_main.cudafe1.cpp:(.text+0x68e): first defined here
/tmp/tmpxft_0000248b_00000000-30_kernel.o: In function `AddInts(int*, int*)':
tmpxft_0000248b_00000000-8_kernel.cudafe1.cpp:(.text+0xe5): multiple definition of `AddInts(int*, int*)'
/tmp/tmpxft_0000248b_00000000-21_main.o:tmpxft_0000248b_00000000-3_main.cudafe1.cpp:(.text+0x6f7): first defined here
collect2: error: ld returned 1 exit status
我相信该错误是由设备函数get_global_index()的定义引起的,但我不明白它的问题所在; 有谁知道错在哪里吗?
两种选择:
nvcc main.cu
),因为已经包含了kernel.cu
,所以它已经可以使用了。 不包括kernel.cu
在main.cu
。
当您在main.cu
包含kernel.cu
( 并将两个文件都传递给编译器)时,它将导致编译器对该代码(kernel.cu)进行两次编译,一次是在编译main.cu
,一次是在编译kernel.cu
。 如果选择此选项,则需要在main.cu
为AddInts
内核提供原型(正向引用),也许只需在该原型中包含头文件即可。 在更一般的情况下,如果您将内容分散到更多文件中,则可能需要在编译命令行中添加-rdc=true
,例如,如果您具有带有__global__
函数的文件,而这些文件在其他文件中引用了__device__
函数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.