简体   繁体   English

如何在GPU上执行基本操作(+-* /)并在其上存储结果

[英]How to perform basic operations (+ - * /) on GPU and store the result on it

I have the following code line, gamma is a CPU variable, that after i will need to copy to GPU. 我有以下代码行, gamma是一个CPU变量,在我需要复制到GPU之后。 gamma_x and delta are also stored on CPU. gamma_xdelta也存储在CPU中。 Is there any way that i can execute the following line and store its result directly on GPU? 有什么方法可以执行以下行并将其结果直接存储在GPU上? So basically, host gamma , gamma_x and delta on GPU and get the output of the following line on GPU. 因此,基本上,在GPU上托管gammagamma_xdelta ,并在GPU上获取以下行的输出。 It would speed up my code a lot for the lines after. 之后的代码行将大大加快我的代码的速度。 I tried with magma_dcopy but so far i couldn't find a way to make it working because the output of magma_ddot is CPU double. 我尝试使用magma_dcopy但到目前为止,我找不到使它工作的方法,因为magma_ddot的输出是CPU的两倍。

gamma = -(gamma_x[i+1] + magma_ddot(i,&d_gamma_x[1],1,&(d_l2)[1],1, queue))/delta;

The very short answer is no, you can't do this, or least not if you use magma_ddot . 简短的答案是不,您不能这样做,或者至少在使用magma_ddot不能这样做。

However, magma_ddot is itself a only very thin wrapper around cublasDdot , and the cublas function fully supports having the result of the operation stored in GPU memory rather than returned to the host. 然而, magma_ddot是围绕自身仅非常薄的包装cublasDdot和CUBLAS功能完全支持具有存储在GPU存储器,而不是返回到主机的操作的结果。

In theory you could do something like this: 理论上讲,您可以执行以下操作:

// before the apparent loop you have not shown us:
double* dotresult;
cudaMalloc(&dotresult, sizeof(double));

for (int i=....) { 
    // ...

    // magma_ddot(i,&d_gamma_x[1],1,&(d_l2)[1],1, queue);
    cublasSetPointerMode( queue->cublas_handle(), CUBLAS_POINTER_MODE_DEVICE);
    cublasDdot(queue->cublas_handle(), i, &d_gamma_x[1], 1, &(d_l2)[1], 1, &dotresult);
    cudaDeviceSynchronize();
    cublasSetPointerMode( queue->cublas_handle(), CUBLAS_POINTER_MODE_HOST);

    // Now dotresult holds the magma_ddot result in device memory

    // ...

}

Note that might make Magma blow up depending on how you are using it, because Magma uses CUBLAS internally and how CUBLAS state and asynchronous operations are handled inside Magma are completely undocumented. 请注意,这可能会使Magma崩溃,具体取决于您的使用方式,因为Magma在内部使用CUBLAS,并且在Magma内部如何处理CUBLAS状态和异步操作完全没有记录。 Having said that, if you are careful, it should be OK. 话虽如此,如果您小心一点,应该可以。

To then execute your calculation, either write a very simple kernel and launch it with one thread, or perhaps use a simple thrust call with a lambda expression, depending on your preference. 为了执行计算,可以编写一个非常简单的内核并使用一个线程启动它,或者根据您的喜好使用带有lambda表达式的简单推力调用。 I leave that as an exercise to the reader. 我把它留给读者练习。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM