简体   繁体   English

我可以在CUDA设备上为包含浮点数数组的对象分配内存吗?

[英]Can I allocate memory on CUDA device for objects containing arrays of float numbers?

I am working on parallel solving of identical ordinary differential equations with different initial conditions. 我正在对具有不同初始条件的相同常微分方程进行并行求解。 I have solved this problem with OpenMP and now I want to implement similar code on GPU. 我已经使用OpenMP解决了这个问题,现在我想在GPU上实现类似的代码。 Specifically, I want to allocate memory on device for floats in class constructor and then deallocate it in destructor. 具体来说,我想在设备上为类构造函数中的浮点数分配内存,然后在析构函数中对其进行分配。 It doesn't work for me since I get my executable "terminated by signal SIGSEGV (Address boundary error)". 它对我不起作用,因为我得到的可执行文件“被信号SIGSEGV(地址边界错误)终止”。 Is it possible to use classes, constructors and destructors in CUDA? 是否可以在CUDA中使用类,构造函数和析构函数?

By the way, I am newbie in CUDA and not very experienced in C++ either. 顺便说一句,我是CUDA的新手,也没有C ++的经验。

I attach the code in case I have described my problem poorly. 如果我对问题的描述不佳,我会附上代码。

#include <cmath>
#include <iostream>
#include <fstream>
#include <iomanip>
#include <random>
#include <string>
#include <chrono>
#include <ctime>

using namespace std;

template<class ode_sys>
class solver: public ode_sys 
{
    public:
    int *nn;
    float *t,*tt,*dt,*x,*xx,*m0,*m1,*m2,*m3;

    using ode_sys::rhs_sys;

    __host__ solver(int n): ode_sys(n)
    { //here I try to allocate memory. It works malloc() and doesn't with cudaMalloc() 
        size_t size=sizeof(float)*n;
        cudaMalloc((void**)&nn,sizeof(int));
        *nn=n;
        cudaMalloc((void**)&t,sizeof(float));
        cudaMalloc((void**)&tt,sizeof(float));
        cudaMalloc((void**)&dt,sizeof(float));
        cudaMalloc((void**)&x,size);
        cudaMalloc((void**)&xx,size);
        cudaMalloc((void**)&m0,size);
        cudaMalloc((void**)&m1,size);
        cudaMalloc((void**)&m2,size);
        cudaMalloc((void**)&m3,size);
    }

    __host__ ~solver()
    {
        cudaFree(nn);
        cudaFree(t);
        cudaFree(tt);
        cudaFree(dt);
        cudaFree(x);
        cudaFree(xx);
        cudaFree(m0);
        cudaFree(m1);
        cudaFree(m2);
        cudaFree(m3);
    }

    __host__ __device__ void rk4()
    {//this part is not important now. 
    }
};

class ode 
{
    private:
    int *nn;

    public:
    float *eps,*d;

    __host__ ode(int n)
    {
        cudaMalloc((void**)&nn,sizeof(int));
        *nn=n;
        cudaMalloc((void**)&eps,sizeof(float));
        size_t size=sizeof(float)*n;
        cudaMalloc((void**)&d,size);
    }

    __host__ ~ode()
    {
        cudaFree(nn);
        cudaFree(eps);
        cudaFree(d);
    }

    __host__ __device__ float f(float x_,float y_,float z_,float d_)
    {
        return d_+*eps*(sinf(x_)+sinf(z_)-2*sinf(y_));
    }

    __host__ __device__ void rhs_sys(float *t,float *dt,float *x,float *dx)
    {
    }
};

//const float pi=3.14159265358979f;

__global__ void solver_kernel(int m,int n,solver<ode> *sys_d)
{
    int index = threadIdx.x;
    int stride = blockDim.x;

    //actually ode numerical evaluation should be here
    for (int l=index;l<m;l+=stride)
    {//this is just to check that i can run kernel
        printf("%d Hello \n", l);
    }
}

int main ()
{
    auto start = std::chrono::system_clock::now();
    std::time_t start_time = std::chrono::system_clock::to_time_t(start);
    cout << "started computation at " << std::ctime(&start_time);

    int m=128,n=4,l;// i want to run 128 threads, n is dimension of ode

    size_t size=sizeof(solver<ode>(n));
    solver<ode> *sys_d;   //an array of objects
    cudaMalloc(&sys_d,size*m);    //nvprof shows that this array is allocated

    for (l=0;l<m;l++)
    {
        new (sys_d+l) solver<ode>(n);   //it doesn't work as it meant to
    }

    solver_kernel<<<1,m>>>(m,n,sys_d);   

    for (l=0;l<m;l++)
    {
        (sys_d+l)->~solver<ode>();    //it doesn't work as it meant to
    }
    cudaFree(sys_d);    //it works

    auto end = std::chrono::system_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::time_t end_time = std::chrono::system_clock::to_time_t(end);
    std::cout << "finished computation at " << std::ctime(&end_time) << "elapsed time: " << elapsed_seconds.count() << "s\n";

    return 0;
}

//end of file

Distinguish host-side and device-side memory 区分主机端和设备端内存

As other answer also state: 作为其他答案也指出:

  • GPU (global) memory you allocate with cudaMalloc() is not accessible by code running on the CPU; 使用cudaMalloc()分配的GPU(全局)内存无法由CPU上运行的代码访问 and
  • System memory (aka host memorY) you allocate in plain C++ (with std::vector , with std::make_unique , with new etc.) is not accessible by code running on the GPU. 在GPU上运行的代码无法访问在普通C ++中分配的系统内存(又名主机内存)(带有std::vector ,带有std::make_unique以及带有new等)。

So, you need to allocate both host-side and device-side memory. 因此,您需要分配主机端和设备端内存。 For a simple example of working with both device-side and host-side memory see the CUDA vectorAdd sample program . 有关使用设备端和主机端存储器的简单示例,请参见CUDA vectorAdd示例程序

(Actually, you can also make a special kind of allocation which is accessible from both the device and the host; this is Unified Memory . But let's ignore that for now since we're dealing with the basics.) (其实,你也可以做一个特殊的分配,这从设备和主机都可以访问的,这是统一内存但让我们忽略了现在,因为我们正在处理的基本知识。)

Don't live in the kingdom of nouns 不要生活在名词王国中

Specifically, I want to allocate memory on device for floats in class constructor and then deallocate it in destructor. 具体来说,我想在设备上为类构造函数中的浮点数分配内存,然后在析构函数中对其进行分配。

I'm not sure you really want to do that. 我不确定您是否真的想这样做。 You seem to be taking a more Java-esque approach, in which everything you do is noun-centric, ie classes are used for everything: You don't solve equations, you have an "equation solver". 您似乎采用了一种更具Java风格的方法,其中您所做的一切均以名词为中心,即所有类均使用类:您不求解方程式,而是拥有一个“方程求解器”。 You don't "do X", you have an "XDoer" class etc. Why not just have a (templated) function which solves an ODE system, returning the solution? 您没有“做X”,而是有“ XDoer”类,等等。为什么不仅仅拥有一个(模板化的)函数来解决ODE系统并返回解决方案? Are you using your "solver" in any other way? 您是否以其他方式使用“求解器”?

(this point is inspired by Steve Yegge's blog post, Execution in the Kingdom of Nouns .) (这一点是受史蒂夫·耶格(Steve Yegge)的博客文章《名词王国中的执行》的启发。

Try to avoid allocating and de-allocating yourself 尽量避免自己分配和取消分配

In well-written modern C++, we try to avoid direct, manual allocation of memory (that's a link to the C++ Core Programming Guidelines by the way). 在编写良好的现代C ++中,我们尝试避免直接手动分配内存 (顺便说一下,这是C ++核心编程指南的链接)。 Now, it's true that you free your memory with the destructor, so it's not all that bad, but I'd really consider using std::unique_ptr on the host and something equivalent on the device (like cuda::memory::unique_ptr from my Modern-C++ CUDA API wrapper cuda-api-wrappers library); 现在,确实可以使用析构函数释放内存了,所以还不算那么糟糕,但是我真的会考虑在主机上使用std::unique_ptr并在设备上使用等效的东西(例如cuda::memory::unique_ptr我的Modern-C ++ CUDA API包装器cuda-api-wrappers库); or a GPU-oriented container class like thrust 's device vector. 或类似GPU的容器类(例如thrust的设备矢量)。

Check for errors 检查错误

You really must check for errors after you call CUDA API functions. 调用CUDA API函数后,您确实必须检查错误 And this is doubly necessary after you launch a kernel. 启动内核后,这是双重需要的。 When you call a C++ standard library code, it throws an exception on error; 当您调用C ++标准库代码时,它会在错误时引发异常。 CUDA's runtime API is C-like, and doesn't know about exceptions. CUDA的运行时API类似于C,并且不了解异常。 It will just fail and set some error variable you need to check. 它只会失败并设置一些您需要检查的错误变量。

So, either you write error checks, like in the vectorAdd() sample I linked to above, or you get some library to exhibit more standard-library-like behavior. 因此,您可以编写错误检查(如我在上面链接的vectorAdd()示例中vectorAdd() ,或者获得一些库来展现更多类似于标准库的行为。 cuda-api-wrappers and thrust will both do that - on different levels of abstraction; cuda-api-wrappersthrust都将在不同的抽象级别上做到这一点; and so will other libraries/frameworks. 其他库/框架也是如此。

You need an array on the host side and one on the device side. 在主机端需要一个阵列,在设备端需要一个阵列。

Initialize the host array, then copy it to the device array with cudaMemcpy . 初始化主机阵列,然后使用cudaMemcpy将其复制到设备阵列。 The destruction has to be done on the host side again. 销毁必须再次在主机端进行。

An alternative would be to initialize the array from the device, you would need to put __device__ in front of your constructor, then just use malloc . 一种替代方法是从设备初始化数组,您需要将__device__放在构造函数的前面,然后仅使用malloc

You can not dereference pointer to device memory in host code: 您不能在主机代码中解引用指向设备内存的指针:

__host__ ode(int n)
{
    cudaMalloc((void**)&nn,sizeof(int));
    *nn=n; // !!! ERROR
    cudaMalloc((void**)&eps,sizeof(float));
    size_t size=sizeof(float)*n;
    cudaMalloc((void**)&d,size);
}

You will have to copy the values with cudaMemcpy. 您将必须使用cudaMemcpy复制值。 (Or use the parameters of a __global__ function.) (或使用__global__函数的参数。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM