访问指向GPU上其他向量的指针的向量

Question

so this is a followup to a question i had, at the moment in a CPU version of some Code, i have many things that look like the following: 因此，这是对我所提出问题的跟进，目前在某些代码的CPU版本中，我有许多类似以下内容的内容：

for(int i =0;i<N;i++){

    dgemm(A[i], B[i],C[i], Size[i][0], Size[i][1], Size[i][2], Size[i][3], 'N','T');

}

where A[i] will be a 2D matrix of some size. 其中A [i]将是某个大小的2D矩阵。

I would like to be able to do this on a GPU using CULA (I'm not just doing multiplies, so i need the Linear ALgebra operations in CULA), so for example: 我希望能够在使用CULA的GPU上做到这一点（我不只是在做乘法，所以我需要CULA中的线性代数运算），例如：

 for(int i =0;i<N;i++){
        status = culaDeviceDgemm('T', 'N', Size[i][0], Size[i][0], Size[i][0], alpha, GlobalMat_d[i], Size[i][0], NG_d[i], Size[i][0], beta, GG_d[i], Size[i][0]);
}

but I would like to store my B's on the GPU in advance at the start of the program as they dont change, so I need to have a vector that contains pointers to the set of vectors that make up my B's. 但是我想在程序开始时将B预先存储，因为B不变，所以我需要一个向量，该向量包含指向构成B的向量集的指针。

i currently have the following code that compiles: 我目前有以下代码可以编译：

double **GlobalFVecs_d;
double **GlobalFPVecs_d;

extern "C" void copyFNFVecs_(double **FNFVecs, int numpulsars, int numcoeff){


  cudaError_t err;
  GlobalFPVecs_d = (double **)malloc(numpulsars * sizeof(double*));
 err = cudaMalloc( (void ***)&GlobalFVecs_d, numpulsars*sizeof(double*) );
 checkCudaError(err);

    for(int i =0; i < numpulsars;i++){
         err = cudaMalloc( (void **) &(GlobalFPVecs_d[i]), numcoeff*numcoeff*sizeof(double) );
         checkCudaError(err);    
         err = cudaMemcpy( GlobalFPVecs_d[i], FNFVecs[i], sizeof(double)*numcoeff*numcoeff, cudaMemcpyHostToDevice );
         checkCudaError(err);   
        }

         err = cudaMemcpy( GlobalFVecs_d, GlobalFPVecs_d, sizeof(double*)*numpulsars, cudaMemcpyHostToDevice );
         checkCudaError(err);

}

but if i now try and access it with: 但是如果我现在尝试使用以下方法访问它：

 dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
 dim3 dimGrid;//((G + dimBlock.x - 1) / dimBlock.x,(N + dimBlock.y - 1) / dimBlock.y);
 dimGrid.x=(numcoeff + dimBlock.x - 1)/dimBlock.x;
 dimGrid.y = (numcoeff + dimBlock.y - 1)/dimBlock.y;

 for(int i =0; i < numpulsars; i++){
    CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFVecs_d[i], numpulsars, numcoeff, i);
 }

it seg faults here, is this not how to get at the data? 这是段错误，这不是如何获取数据吗？

The kernal function that i'm calling is just: 我正在调用的核心功能只是：

__global__ void CopyPPFNF(double *FNF_d, double *PPFNF_d, int numpulsars, int numcoeff, int thispulsar) {

    // Each thread computes one element of C
    // by accumulating results into Cvalue




    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    int subrow=row-thispulsar*numcoeff;
    int subcol=row-thispulsar*numcoeff;

     __syncthreads();
    if(row >= (thispulsar+1)*numcoeff || col >= (thispulsar+1)*numcoeff) return;
    if(row < thispulsar*numcoeff || col < thispulsar*numcoeff) return;


    FNF_d[row * numpulsars*numcoeff + col] += PPFNF_d[subrow*numcoeff+subcol];

}

What am i not doing right? 我做错了吗？ Note eventually I would also like to do as the first example, calling cula functions on each GlobalFVecs_d[i], but for now not even this works. 最终请注意，我也想作为第一个示例，在每个GlobalFVecs_d [i]上调用cula函数，但现在甚至行不通。

Do you think this is the best way to go about doing this? 您认为这是执行此操作的最佳方法吗？ If it were possible to just pass CULA functions a slice of a large continuous vector I could do that to, but i don't know if it supports that. 如果有可能只传递CULA函数，则可以对大型连续向量进行切片，但是我不知道它是否支持。

Cheers Lindley 干杯林德利

Answer 1

change this: 改变这个：

CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFVecs_d[i], numpulsars, numcoeff, i);

to this: 对此：

CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFPVecs_d[i], numpulsars, numcoeff, i);

and I believe it will work. 而且我相信它将成功。

Your methodology of handling pointers is mostly correct. 您处理指针的方法大部分是正确的。 However, when you put GlobalFVecs_d[i] in the parameter list, you are forcing the kernel setup code (running on the host) to take GlobalFVecs_d (a device pointer, created with cudaMalloc ), add an appropriately scaled i to the pointer value, and then dereference the resultant pointer to retrieve the value to pass as a parameter to the kernel. 但是，将GlobalFVecs_d[i]放在参数列表中时，您正在强制内核设置代码（在主机上运行）采用GlobalFVecs_d （使用cudaMalloc创建的设备指针），在指针值上添加适当缩放的i ，然后取消对结果指针的引用，以检索要作为参数传递给内核的值。 But we are not allowed to dereference device pointers in host code. 但是我们不允许在主机代码中取消引用设备指针。

However, because your methodology was mostly correct, you have a convenient parallel array of the same pointers that resides on the host. 但是，由于您的方法学基本上是正确的，因此您可以在主机上方便地使用相同指针的 并行数组 。 This array ( GlobalFPVecs_d ) is something that we are allowed to dereference into, in host code, to retrieve the resultant device pointer, to pass to the kernel. 我们可以在主机代码中将此数组（ GlobalFPVecs_d ）取消引用，以检索结果的设备指针，并传递给内核。

It's an interesting bug because normally kernels do not seg fault (although they may throw an error), so a seg fault on a kernel invocation line is unusual. 这是一个有趣的错误，因为正常情况下内核不会发生段错误（尽管它们可能会引发错误），因此内核调用行上的段错误并不常见。 But in this case, the seg fault is occurring in the kernel setup code, not the kernel itself. 但是在这种情况下，seg错误发生在内核设置代码中，而不是内核本身。

访问指向GPU上其他向量的指针的向量

问题描述

1 个解决方案

解决方案1
0 已采纳 2013-05-31 16:30:25

访问指向GPU上其他向量的指针的向量

问题描述

1 个解决方案

解决方案1 0 已采纳 2013-05-31 16:30:25

解决方案1
0 已采纳 2013-05-31 16:30:25