简体   繁体   English

如何在GPU上通过CULA使用3D矩阵?

[英]How to use 3D matrices with CULA on a GPU?

In a CPU version of some Code, I have many things that look like the following: 在某些代码的CPU版本中,我有许多类似于以下内容的内容:

for(int i =0;i<N;i++){

    dgemm(A[i], B[i],C[i], Size[i][0], Size[i][1], Size[i][2], Size[i][3], 'N','T');

}

Where A[i] will be a 2D matrix of some size. 其中A[i]将是某个大小的2D矩阵。

I would like to be able to do this on a GPU using CULA (I'm not just doing multiplies, so I need the Linear ALgebra operations in CULA), so for example: 我希望能够在使用CULA的GPU上做到这一点(我不只是在做乘法,所以我需要CULA中的线性代数运算),例如:

 for(int i =0;i<N;i++){
        status = culaDeviceDgemm('T', 'N', Size[i][0], Size[i][0], Size[i][0], alpha, GlobalMat_d[i], Size[i][0], NG_d[i], Size[i][0], beta, GG_d[i], Size[i][0]);
}

However, I would like to store my B's on the GPU in advance at the start of the program as they don't change, but I have no idea how to go about doing that. 但是,我想在程序开始时将B预先存储在GPU上,因为它们没有变化,但是我不知道该怎么做。 Or how I could store my arrays in general so that this is possible. 或者,我一般如何存储我的数组,以便做到这一点。

I've seen various things online about using 3D matrices with CUDA, but they don't seem very applicable to being able to then make a function call to the CULA functions. 我在网上看到了很多关于将3D矩阵与CUDA结合使用的信息,但它们似乎并不适用于随后可以对CULA函数进行函数调用。

From the example in the answer below I have this: 从下面的答案中的示例,我有这个:

extern "C" void copyFNFVecs_(double **FNFVecs, int numpulsars, int numcoeff){


  cudaError_t err;
 err = cudaMalloc( (void ***)&GlobalFVecs_d, numpulsars*sizeof(double*) );
 checkCudaError(err);

    for(int i =0; i < numpulsars;i++){
         err = cudaMalloc( (void **) &(GlobalFVecs_d[i]), numcoeff*numcoeff*sizeof(double) );
         checkCudaError(err);    
       //  err = cudaMemcpy( GlobalFVecs_d[i], FNFVecs[i], sizeof(double)*numcoeff*numcoeff, cudaMemcpyHostToDevice );
        // checkCudaError(err); 
        }

}

Where I have declared double **GlobalFVecs_d to be a global. 我在这里声明double ** GlobalFVecs_d为全局变量。 But I get a seg fault when it hits the line 但是我碰到一条断线故障

 err = cudaMalloc( (void **) &(GlobalFVecs_d[i]), numcoeff*numcoeff*sizeof(double) );

Yet it seems to be exactly what is in the other example? 但这似乎正好在另一个示例中?

I realised it wasn't the same, so I now have code that compiles, with: 我意识到这是不一样的,所以我现在可以使用以下代码进行编译:

double **GlobalFVecs_d;
double **GlobalFPVecs_d;

extern "C" void copyFNFVecs_(double **FNFVecs, int numpulsars, int numcoeff){


  cudaError_t err;
  GlobalFPVecs_d = (double **)malloc(numpulsars * sizeof(double*));
 err = cudaMalloc( (void ***)&GlobalFVecs_d, numpulsars*sizeof(double*) );
 checkCudaError(err);

    for(int i =0; i < numpulsars;i++){
         err = cudaMalloc( (void **) &(GlobalFPVecs_d[i]), numcoeff*numcoeff*sizeof(double) );
         checkCudaError(err);    
         err = cudaMemcpy( GlobalFPVecs_d[i], FNFVecs[i], sizeof(double)*numcoeff*numcoeff, cudaMemcpyHostToDevice );
         checkCudaError(err);   
        }

         err = cudaMemcpy( GlobalFVecs_d, GlobalFPVecs_d, sizeof(double*)*numpulsars, cudaMemcpyHostToDevice );
         checkCudaError(err);

}

However, if I now try and access it with: 但是,如果我现在尝试通过以下方式访问它:

 dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
 dim3 dimGrid;//((G + dimBlock.x - 1) / dimBlock.x,(N + dimBlock.y - 1) / dimBlock.y);
 dimGrid.x=(numcoeff + dimBlock.x - 1)/dimBlock.x;
 dimGrid.y = (numcoeff + dimBlock.y - 1)/dimBlock.y;

 for(int i =0; i < numpulsars; i++){
    CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFVecs_d[i], numpulsars, numcoeff, i);
 }

It seg faults here instead, is this not how to get at the data? 而是在这里隔离故障,这不是如何获取数据吗?

  1. Allocate memory for B with cudaMalloc() 使用cudaMalloc()B分配内存
  2. Copy it from host to device with cudaMemcpy() 使用cudaMemcpy()将其从主机复制到设备
  3. Pass the device pointer in the kernel argument list 在内核参数列表中传递设备指针

Finally you use it from the kernel with the argument you have passed! 最后,您可以在内核中将其与您传递的参数一起使用! Example: 例:

  1     //  Kernel definition, see also section 4.2.3 of Nvidia Cuda Programming Guide 
  2     __global__  void vecAdd(float* A, float* B, float* C) 
  3     { 
  4        // threadIdx.x is a built-in variable  provided by CUDA at runtime 
  5        int i = threadIdx.x; 
  6        A[i]=0; 
  7        B[i]=i; 
  8        C[i] = A[i] + B[i]; 
  9     } 
  10     
  11     #include  <stdio.h> 
  12     #define  SIZE 10 
  13     int  main() 
  14     { 
  15         int N=SIZE; 
  16         float A[SIZE], B[SIZE], C[SIZE]; 
  17         float *devPtrA; 
  18         float *devPtrB; 
  19         float *devPtrC; 
  20         int memsize= SIZE * sizeof(float); 
  21     
  22         **cudaMalloc((void**)&devPtrA, memsize);** 
  23         cudaMalloc((void**)&devPtrB, memsize); 
  24         cudaMalloc((void**)&devPtrC, memsize); 
  25         **cudaMemcpy(devPtrA, A, memsize,  cudaMemcpyHostToDevice);** 
  26         cudaMemcpy(devPtrB, B, memsize,  cudaMemcpyHostToDevice); 
  27         // __global__ functions are called:  Func<<< Dg, Db, Ns  >>>(parameter); 
  28         **vecAdd<<<1, N>>>(devPtrA,  devPtrB, devPtrC);** 
  29         cudaMemcpy(C, devPtrC, memsize,  cudaMemcpyDeviceToHost); 
  30     
  31         for (int i=0; i<SIZE; i++) 
  32          printf("C[%d]=%f\n",i,C[i]); 
  33     
  34          cudaFree(devPtrA); 
  35         cudaFree(devPtrA); 
  36         cudaFree(devPtrA); 
  37     } 

The ** areas are the important part for you. **区域对您来说很重要。 Example taken from here . 例子取自这里 You may want to look at this question. 您可能想看看这个问题。

EDIT#1: First of all to declare a kernel function you need to place the keyword __global__ before the returning type, eg 编辑#1:首先要声明一个内核函数,您需要将关键字__global__放在返回类型之前,例如

__global__ void copyFNFVecs_(double **FNFVecs, int numpulsars, int numcoeff) . __global__ void copyFNFVecs_(double **FNFVecs, int numpulsars, int numcoeff)

Moreover, I would use just one pointer to the first element of the matrix you have. 而且,我只使用一个指向您矩阵的第一个元素的指针。

double *devPtr . double *devPtr

Allocate it with 用它分配

cudaMalloc((void*)&devPtr, size)

and then copy 然后复制

cudaMemcpy(devPtr, hostPtr, size, hostToDevice) . cudaMemcpy(devPtr, hostPtr, size, hostToDevice)

Note that to calculate the size your structure you need the dimensions (say X and Y) and the size of the underlying type of elements (say double). 请注意,要计算结构的大小,您需要尺寸(例如X和Y)和基础元素的类型大小(例如double)。

size_t size = X*Y*sizeof(double) . size_t size = X*Y*sizeof(double)

sizeof(double *) means size of pointer to a double which is incorrect (In 32bit machines the size of a pointer is 4 bytes but the size of double is 8 bytes). sizeof(double *)表示指向不正确的double的指针大小(在32位计算机中,指针的大小为4个字节,而double的大小为8个字节)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM