cuda＆cublas：使用cublas后调用全局函数

Question

I write a program involves some matrix-vector multiplication and least-square solving all using cublas & cula API . 我编写了一个程序，该程序使用cublas＆cula API进行一些矩阵向量乘法和最小二乘求解。 The program will iterate many times . 该程序将迭代多次。 in each step I must set one matrix's particular row all to zero. 在每一步中，我必须将一个矩阵的特定行全部设置为零。

I tried to copy entire matrix(50*1000 or larger) into cpu and set one row to zero then copy matrix back, but it is too time-consuming because program will iterate 10 times or more. 我试图将整个矩阵（50 * 1000或更大）复制到cpu中，并将一行设置为零，然后再复制回矩阵，但是这太浪费时间了，因为程序将迭代10次或更多次。 So I decide to write a kernel function. 因此，我决定编写一个内核函数。

The global function like this: 全局函数是这样的：

__global__ void Setzero(float* A, int index) /* A is the matrix and in col-major , index is the row I want to set zero */
{
    int ind=blockDim.x*blockIdx.x+threadIdx.x;
    if( ((ind%N)==index ) && (ind<50000) )  //notice matrix is in col-major ,matrix size is 50000
    {   
    A[ind]=0.0;
        ind+=blockDim.x*blockIdx.x;
    }
    else    ;
        __syncthreads();   
}

The question is when I do this(use cublas before call the function ): 问题是我何时执行此操作（在调用函数之前使用cublas）：

cudaMalloc((void**)&A_Gpu_trans,sizeof(float)*50000);
cudaMemcpy(A_Gpu_trans,A_trans,sizeof(float)*M*N,cudaMemcpyHostToDevice);
cublasSgemv_v2(handle,CUBLAS_OP_N,1000,50,&al,A_Gpu_trans,1000,err_gpu,1,&beta,product,1);
dim3 dimBlock(16,1);
dim3 dimGrid((50000-1)/16+1,1);
Setzero<<<dimGrid,dimBlock>>>(A_Gpu_trans,Index);

It return the error: 它返回错误：

a __host__ function("Setzero") redeclared with __global__.

and an other error: 和另一个错误：

MSB3721: command“"C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v5.5\\bin\\nvcc.exe" -gencode=arch=compute_10,code=\\"sm_10,compute_10\\" --use-local-env --cl-version 2010 -ccbin "D:\\Program Files\\Microsoft Visual Studio 10.0\\VC\\bin" -I"C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v5.5\\include" -I"C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v5.5\\include" -G --keep-dir Debug -maxrregcount=0 --machine 32 --compile -cudart static -g -DWIN32 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd " -o Debug\\kernel.cu.obj "C:\\Users\\Administrator\\documents\\visual studio 2010\\Projects\\OOmp\\OOmp\\kernel.cu"”return 2。 MSB3721：命令““ C：\\ Program Files \\ NVIDIA GPU Computing Toolkit \\ CUDA \\ v5.5 \\ bin \\ nvcc.exe” -gencode = arch = compute_10，code = \\“ sm_10，compute_10 \\” --use-local- env --cl-version 2010 -ccbin“ D：\\ Program Files \\ Microsoft Visual Studio 10.0 \\ VC \\ bin” -I“ C：\\ Program Files \\ NVIDIA GPU Computing Toolkit \\ CUDA \\ v5.5 \\ include” -I“ C：\\ Program Files \\ NVIDIA GPU Computing Toolkit \\ CUDA \\ v5.5 \\ include“ -G --keep-dir调试-maxrregcount = 0 --machine 32 --compile -cudart static -g -DWIN32 -D_DEBUG -D_CONSOLE- D_MBCS -Xcompiler“ / EHsc / W3 / nologo / Od / Zi / RTC1 / MDd” -o Debug \\ kernel.cu.obj“ C：\\ Users \\ Administrator \\ documents \\ visual studio 2010 \\ Projects \\ OOmp \\ OOmp \\ kernel。 cu“”返回2。

It is strange when I only use cublas & cula API I can get the right answer. 当我仅使用cublas＆cula API时，我会得到正确的答案，这很奇怪。

Answer 1

Also, your function is both wrong and wildly inefficient... 而且，您的函数既错误又效率低下...

You can't have a syncthread call in a conditional like this, it will possibly lead to a hang. 不能在这样的条件下进行syncthread调用，否则可能会导致挂起。 It also appears to be entirely unnecessary here. 这里似乎也完全没有必要。

More the to point, you are launching one thread for every matrix entry, and only 1/N of them actually do anything. 更重要的是，您要为每个矩阵条目启动一个线程，而实际上只有其中的1 / N可以执行任何操作。

A better approach is to launch only threads corresponding to entries which will be set to zero. 更好的方法是仅启动与将被设置为零的条目相对应的线程。 Something like this: 像这样：

__global__ void Setzero(float* A, int index) 
{
  int ind=blockDim.x*blockIdx.x+threadIdx.x;
  if (ind < M)   
    A[index+N*ind]=0.0;
}

and you launch M threads (or rather, ceil(M/256) threadblocks of 256 threads threads each, or whatever block size you want). 然后启动M个线程（或者，每个ceil（M / 256）个256个线程的线程块，或者您想要的任何块大小）。

Eg: 例如：

int block_size = 256; // usually a good choice
int num_blocks = (M + block_size - 1) / block_size;
Setzero<<<num_blocks, block_size>>>(A, index);

Answer 2

Although you have not shown it in your question, you have clearly got another host function called Setzero somewhere in your code. 尽管您没有在问题中显示它，但是您显然在代码中的某个地方有了另一个名为Setzero宿主函数。 The simple solution is to rename the kernel to something else. 简单的解决方案是将内核重命名为其他名称。

The underlying reason why the CUDA toolchain emits the error is because the Setzero<<< >>> kernel invocation syntax in the runtime API causes the CUDA front end to create a host function of the same name as the kernel with a matching argument list and substitute the kernel launch for a call to that function. CUDA工具链发出错误的根本原因是因为运行时API中的Setzero<<< >>>内核调用语法导致CUDA前端创建与内核同名的宿主函数，并且具有匹配的参数列表，并且用内核启动代替对该函数的调用。 This host function contains the necessary API calls to launch the kernel. 该主机函数包含启动内核所需的API调用。 By having another host function with the same name as the kernel, you defeat this process and cause the compilation error you see. 通过使用与内核同名的另一个主机功能，您将使此过程失败，并导致看到编译错误。

cuda＆cublas：使用cublas后调用全局函数

问题描述

2 个解决方案

解决方案1
1 2014-03-08 22:47:22

解决方案2
0 已采纳 2014-03-08 10:18:03

cuda＆cublas：使用cublas后调用全局函数

问题描述

2 个解决方案

解决方案1 1 2014-03-08 22:47:22

解决方案2 0 已采纳 2014-03-08 10:18:03

解决方案1
1 2014-03-08 22:47:22

解决方案2
0 已采纳 2014-03-08 10:18:03