简体   繁体   English

将二维推力::device_vector 复杂矩阵传递给 CUDA kernel function

[英]Pass 2D thrust::device_vector Complex Matrix to CUDA kernel function

I'm new in Cuda and and I'm trying to move my existing Project to GPU using Cuda.我是 Cuda 的新手,我正在尝试使用 Cuda 将我现有的项目移动到 GPU。 My code are based on complex matrices and complex buffers.我的代码基于复杂的矩阵和复杂的缓冲区。

For the first step, I tried to move That nested For loop Code to Cuda (the rest will be similar):第一步,我尝试将嵌套的 For 循环代码移动到 Cuda(rest 将类似):

     typedef thrust::complex<double> smp_t;

  uint8_t *binbuffer = (uint8_t*) malloc(8 * bufsize * sizeof(uint8_t));
  smp_t *sgbuf = (smp_t*) malloc(8 * bufsize * sizeof(smp_t));
  smp_t *cnbuf = (smp_t*) malloc(8 * bufsize * sizeof(smp_t));

 // Create matrix.
 thrust::complex<double> i_unit(0.0, 1.0);
 thrust::host_vector<thrust::host_vector<smp_t>> tw(decfactor);

  // Fill the Matrix
  for (size_t row = 0; row < 8; row++) {
       for (size_t col = 0; col < 8; col++) {
               std::complex<double> tmp =
                      exp(-i_unit * 2.0*M_PI * ((double) col*row) / (double)8);
              tw[row].push_back(tmp);
      }
  }

/* The Code To Move to the GPU processing */
for (unsigned int i = 0; i < bufsize; i++) {
        for (size_t ch = 0; ch < 8; ch++)
                for (size_t k = 0; k < 8; k++)
                        cnbuf[ch*bufsize + i] += sgbuf[k*bufsize+i] * tw[ch].at(k);
}

That is the Code from the.cu file that will replace the current nested for loop:这是 .cu 文件中的代码,它将替换当前嵌套的 for 循环:

   __global__ void kernel_func(cuDoubleComplex *cnbuf, cuDoubleComplex *sgbuf, smp_t *tw, size_t block_size) {
    unsigned int ch = threadIdx.x;
    unsigned int k = blockIdx.x;

     for (int x = 0; x < block_size; ++x) {
            unsigned int sig_index = k*block_size+x;
            unsigned int tw_index = ch*k;
            unsigned int cn_index = ch*block_size+x;


            cuDoubleComplex temp = cuCmul(sgbuf[sig_index], make_cuDoubleComplex(tw[tw_index].real(), tw[tw_index].imag()));
            cnbuf[cn_index] = cuCadd(temp, cnbuf[cn_index]);
     }
}

void kernel_wrap(
            smp_t *cnbuf,
            smp_t *sgbuf,
            thrust::host_vector<thrust::host_vector<smp_t>>tw,
            size_t buffer_size) {
    smp_t *d_sgbuf;
    smp_t *d_cnbuf;
    thrust::device_vector<smp_t> d_tw(8*8);
    thrust::copy(&tw[0][0], &tw[7][7], d_tw.begin());

    cudaMalloc((void **)&d_sgbuf, buffer_size);
    cudaMalloc((void **)&d_cnbuf, buffer_size);

    cudaMemcpy(d_sgbuf, sgbuf, buffer_size, cudaMemcpyDeviceToHost);
    cudaMemcpy(d_cnbuf, cnbuf, buffer_size, cudaMemcpyDeviceToHost);

    thrust::raw_pointer_cast(d_tw.data());

    kernel_func<<<8, 8>>>(
   reinterpret_cast<cuDoubleComplex*>(d_cnbuf),
                    reinterpret_cast<cuDoubleComplex*>(d_sgbuf),
                    thrust::raw_pointer_cast(d_tw.data()),
                    buffer_size
    );

    cudaError_t varCudaError1 = cudaGetLastError();
    if (varCudaError1 != cudaSuccess)
    {
            std::cout << "Failed to launch subDelimiterExamine kernel (error code: " << cudaGetErrorString(varCudaError1) << ")!" << std::endl;
            exit(EXIT_FAILURE);
    }

    cudaMemcpy(sgbuf, d_sgbuf, buffer_size, cudaMemcpyHostToDevice);
    cudaMemcpy(cnbuf, d_cnbuf, buffer_size, cudaMemcpyHostToDevice);

} }

When I'm running the code, I get the error:当我运行代码时,我收到错误:

Failed to launch subDelimiterExamine kernel (error code: invalid argument)!

I think that the argument that causing the troubles is the 'd_tw'.我认为引起麻烦的论点是'd_tw'。 So, my questions are:所以,我的问题是:

  1. What am I'm doing wrong with the cast of <thrust::host_vector<thrust::host_vector smp_t>> to <thrust::device_vector smp_t>> (from 2d Matrix to one flattened arr)?我在将 <thrust::host_vector<thrust::host_vector smp_t>> 转换为 <thrust::device_vector smp_t>> (从 2d 矩阵到一个扁平 arr)时做错了什么?
  2. Is there a better whey to work with 2D Complex numbers in CUDA?在 CUDA 中是否有更好的处理二维复数的乳清?
  3. The documentation about Complex arrays in Cuda are very poorly, where can I read abound the work with Cuda Complex matrices? Cuda 中关于复杂 arrays 的文档非常差,我在哪里可以阅读大量使用 Cuda 复杂矩阵的工作?

Thanks!!!!谢谢!!!!

There were various problems.有各种各样的问题。 I will list a few, and probably miss some.我会列出一些,可能会错过一些。 So please refer to the example code I have given for additional differences.因此,请参阅我给出的示例代码以了解其他差异。

  1. The most immediate problem is here:最直接的问题在这里:

     thrust::copy(&tw[0][0], &tw[7][7], d_tw.begin());

This is what is giving rise to the invalid argument error you are seeing.这就是导致您看到的无效参数错误的原因。 Underneath the hood, thrust is going to try to use a cudaMemcpyAsync operation for this, because this is inherently a copy from host to device.在底层,thrust 将尝试为此使用cudaMemcpyAsync操作,因为这本质上是从主机到设备的副本。 We will fix this by replacing it with an ordinary cudaMemcpy operation, but to understand how to construct that, it's necessary to understand item 2.我们将通过将其替换为普通的cudaMemcpy操作来解决此问题,但要了解如何构造它,有必要了解第 2 项。

  1. You seem to think that a vector of vectors implies contiguous storage.您似乎认为向量的向量意味着连续存储。 It does not and that statement is not specific to thrust. 它没有,并且该声明并非特定于推力。 Since a thrust::host_vector of vectors (or even std::vector of vectors) does not imply contiguous storage, we can't easily construct a single operation, such as cudaMemcpy or thrust::copy to copy this data.由于向量的thrust::host_vector (甚至向量的std::vector )并不意味着连续存储,我们不能轻易地构造单个操作,例如cudaMemcpythrust::copy来复制这些数据。 Therefore it will be necessary to explicitly flatten it.因此,有必要明确地展平它。

  2. Your directions of copy on the cudaMemcpy operations are universally backward.您对cudaMemcpy操作的复制方向普遍向后。 Where you should have had cudaMemcpyHostToDevice you had cudaMemcpyDeviceToHost , and vice-versa.你应该有cudaMemcpyHostToDevice你有cudaMemcpyDeviceToHost ,反之亦然。

  3. The CUDA cuComplex.h header file predates thrust, and was provided for a quick C-style method to work with complex numbers. CUDA cuComplex.h header 文件早于推力,并提供用于处理复数的快速 C 样式方法。 There is no documentation for it - you have to read the file itself and work out how to use it, as seem to have already done.它没有文档 - 您必须阅读文件本身并弄清楚如何使用它,就像已经完成的那样。 However, since you are using thrust::complex<> anyway, it's far simpler just to use that coding paradigm, and write you device code to look almost exactly like your host code.但是,由于您无论如何都在使用thrust::complex<> ,因此仅使用该编码范例并编写设备代码以使其看起来几乎与您的主机代码完全一样要简单得多。

  4. You had various transfer sizes wrong.你有各种错误的传输大小。 cudaMemcpy takes a size in bytes to transfer. cudaMemcpy需要一个字节大小来传输。

What follows is an example, cobbled together from the pieces you have shown, with a variety of "fixes".下面是一个示例,由您展示的部分拼凑而成,并带有各种“修复”。 I'm not claiming its in any way perfect or correct, but it avoids the issues I have outlined above.我并没有声称它以任何方式完美或正确,但它避免了我上面概述的问题。 Furthermore, depending on how you compile with or with a -DUSE_KERNEL define, it will either run your "original" host code and display the output, or the kernel code and display the output.此外,根据您使用或使用-DUSE_KERNEL定义的编译方式,它将运行您的“原始”主机代码并显示 output,或 kernel 代码并显示 Z78E6221F6393D13596DZF1CEDB3 According to my testing, the outputs match.根据我的测试,输出匹配。

$ cat t1751.cu
#include <thrust/complex.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <iostream>
#include <cstdint>
#include <cuComplex.h>

typedef thrust::complex<double> smp_t;
__global__ void kernel_func_old(cuDoubleComplex *cnbuf, cuDoubleComplex *sgbuf, smp_t *tw, size_t block_size) {
    unsigned int ch = threadIdx.x;
    unsigned int k = blockIdx.x;

     for (int x = 0; x < block_size; ++x) {
            unsigned int sig_index = k*block_size+x;
            unsigned int tw_index = ch*k;
            unsigned int cn_index = ch*block_size+x;


            cuDoubleComplex temp = cuCmul(sgbuf[sig_index], make_cuDoubleComplex(tw[tw_index].real(), tw[tw_index].imag()));
            cnbuf[cn_index] = cuCadd(temp, cnbuf[cn_index]);
     }
}
__global__ void kernel_func(smp_t *cnbuf, smp_t *sgbuf, smp_t *tw, size_t block_size) {
    unsigned row = blockIdx.x;
    unsigned col = threadIdx.x;
    unsigned idx = row*block_size+col;
    for (int k = 0; k < 8; k++)
      cnbuf[idx] += sgbuf[k*block_size+col] * tw[row*block_size+k];
}

void kernel_wrap(
            smp_t *cnbuf,
            smp_t *sgbuf,
            thrust::host_vector<thrust::host_vector<smp_t>>tw,
            size_t buffer_size) {
    smp_t *d_sgbuf;
    smp_t *d_cnbuf;
    thrust::device_vector<smp_t> d_tw(8*8);
//    thrust::copy(&tw[0][0], &tw[7][7], d_tw.begin());
    thrust::host_vector<smp_t> htw(buffer_size*buffer_size);
    for (int i = 0; i < buffer_size; i++)
      for (int j = 0; j < buffer_size; j++)
        htw[i*buffer_size + j] = tw[i][j];

    cudaMemcpy(thrust::raw_pointer_cast(d_tw.data()), &htw[0], 8*8*sizeof(smp_t), cudaMemcpyHostToDevice);
    cudaMalloc((void **)&d_sgbuf, buffer_size*buffer_size*sizeof(smp_t));
    cudaMalloc((void **)&d_cnbuf, buffer_size*buffer_size*sizeof(smp_t));

    cudaMemcpy(d_sgbuf, sgbuf, buffer_size*buffer_size*sizeof(smp_t), cudaMemcpyHostToDevice);
    cudaMemcpy(d_cnbuf, cnbuf, buffer_size*buffer_size*sizeof(smp_t), cudaMemcpyHostToDevice);

    thrust::raw_pointer_cast(d_tw.data());

    kernel_func<<<8, 8>>>(d_cnbuf,d_sgbuf,thrust::raw_pointer_cast(d_tw.data()),buffer_size);

    cudaError_t varCudaError1 = cudaGetLastError();
    if (varCudaError1 != cudaSuccess)
    {
            std::cout << "Failed to launch subDelimiterExamine kernel (error code: " << cudaGetErrorString(varCudaError1) << ")!" << std::endl;
            exit(EXIT_FAILURE);
    }

//    cudaMemcpy(sgbuf, d_sgbuf, buffer_size*buffer_size*sizeof(smp_t), cudaMemcpyDeviceToHost);
    cudaMemcpy(cnbuf, d_cnbuf, buffer_size*buffer_size*sizeof(smp_t), cudaMemcpyDeviceToHost);
    for (int i = 0; i < 8; i++)
      for (int j = 0; j < 8; j++)
        std::cout << cnbuf[i*8+j].real() << "," << cnbuf[i*8+j].imag() << std::endl;
}

int main(){
  const int bufsize = 8;
  const int decfactor = 8;

  uint8_t *binbuffer = (uint8_t*) malloc(8 * bufsize * sizeof(uint8_t));
  smp_t *sgbuf = (smp_t*) malloc(8 * bufsize * sizeof(smp_t));
  smp_t *cnbuf = (smp_t*) malloc(8 * bufsize * sizeof(smp_t));
  memset(cnbuf, 0, 8*bufsize*sizeof(smp_t));
 // Create matrix.
 thrust::complex<double> i_unit(0.0, 1.0);
#ifndef USE_KERNEL
 std::vector<std::vector<smp_t> > tw(decfactor);
#else
 thrust::host_vector<thrust::host_vector<smp_t>> tw(decfactor);
#endif

  // Fill the Matrix
  for (size_t row = 0; row < 8; row++) {
       for (size_t col = 0; col < 8; col++) {
              std::complex<double> tmp = exp(-i_unit * 2.0*M_PI * ((double) col*row) / (double)8);
              tw[row].push_back(tmp);
      }
  }
  thrust::complex<double> test(1.0, 1.0);
  for (int i = 0; i < 8*8; i++) sgbuf[i]  = test;
#ifndef USE_KERNEL
/* The Code To Move to the GPU processing */
for (unsigned int i = 0; i < bufsize; i++) {
        for (size_t ch = 0; ch < 8; ch++)
                for (size_t k = 0; k < 8; k++)
                        cnbuf[ch*bufsize + i] += sgbuf[k*bufsize+i] * tw[ch].at(k);
}
    for (int i = 0; i < 8; i++)
      for (int j = 0; j < 8; j++)
        std::cout << cnbuf[i*8+j].real() << "," << cnbuf[i*8+j].imag() << std::endl;
#else

  kernel_wrap(cnbuf,sgbuf,tw,bufsize);
#endif

}
$ nvcc -o t1751 t1751.cu -std=c++11
$ ./t1751 >out_host.txt
$ nvcc -o t1751 t1751.cu -std=c++11 -DUSE_KERNEL
$ ./t1751 >out_device.txt
$ diff out_host.txt out_device.txt
$

Remember, this is mostly your code, I am not claiming it is correct, or defect-free, or suitable for any particular purpose.请记住,这主要是您的代码,我并不是说它是正确的、没有缺陷的或适用于任何特定目的。 Use it at your own risk.需要您自担风险使用它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM