Cuda统一了gpu和主机之间的内存

Question

I'm writing a cuda-based program that needs to periodically transfer a set of items from the GPU to the Host memory. 我正在编写一个基于cuda的程序，需要定期将一组项目从GPU传输到主机内存。 In order to keep the process asynchronous, I was hoping to use cuda's UMA to have a memory buffer and flag in the host memory (so both the GPU and the CPU can access it). 为了使进程保持异步，我希望使用cuda的UMA在主机内存中有一个内存缓冲区和标志（因此GPU和CPU都可以访问它）。 The GPU would make sure the flag is clear, add its items to the buffer, and set the flag. GPU将确保标志清除，将其项添加到缓冲区，并设置标志。 The CPU waits for the flag to be set, copies things out of the buffer, and clears the flag. CPU等待设置标志，从缓冲区中复制内容，并清除标志。 As far as I can see, this doesn't produce any race condition because it forces the GPU and CPU to take turns, always reading and writing to the flag opposite each other. 据我所知，这不会产生任何竞争条件，因为它迫使GPU和CPU轮流，总是读取和写入彼此相对的标志。

So far I haven't been able to get this to work because there does seem to be some sort of race condition. 到目前为止，我还没有能够让它发挥作用，因为似乎确实存在某种竞争条件。 I came up with a simpler example that has a similar issue: 我提出了一个更简单的例子，它有一个类似的问题：

#include <stdio.h>

__global__
void uva_counting_test(int n, int *h_i);

int main() {
    int *h_i;
    int n;

    cudaMallocHost(&h_i, sizeof(int));

    *h_i = 0;
    n = 2;

    uva_counting_test<<<1, 1>>>(n, h_i);

    //even numbers
    for(int i = 1; i <= n; ++i) {
        //wait for a change to odd from gpu
        while(*h_i == (2*(i - 1)));

        printf("host h_i: %d\n", *h_i);
        *h_i = 2*i;
    }

    return 0;
}

__global__
void uva_counting_test(int n, int *h_i) {
    //odd numbers
    for(int i = 0; i < n; ++i) {
        //wait for a change to even from host
        while(*h_i == (2*(i - 1) + 1));

        *h_i = 2*i + 1;
    }
}

For me, this case always hangs after the first print statement from the CPU ( host h_i: 1 ). 对我来说，这种情况总是在CPU的第一个打印语句（ host h_i: 1 ）之后挂起。 The really unusual thing (which may be a clue) is that I can get it to work in cuda-gdb. 非常不寻常的事情（这可能是一个线索）是我可以让它在cuda-gdb中工作。 If I run it in cuda-gdb, it will hang as before. 如果我在cuda-gdb中运行它，它将像以前一样挂起。 If I press ctrl+C, it will bring me to the while() loop line in the kernel. 如果按ctrl + C，它会将我带到内核中的while（）循环行。 From there, surprisingly, I can tell it to continue and it will finish. 从那里，令人惊讶的是，我可以告诉它继续，它将完成。 For n > 2, it will freeze on the while() loop in the kernel again after each kernel, but I can keep pushing it forward with ctrl+C and continue. 对于n> 2，它会在每个内核之后再次冻结内核中的while（）循环，但我可以继续使用ctrl + C向前推送它并继续。

If there's a better way to accomplish what I'm trying to do, that would also be helpful. 如果有更好的方法来完成我想要做的事情，那也会有所帮助。

Answer 1

You are describing a producer-consumer model, where the GPU is producing some data and from time-to-time the CPU will consume that data. 您正在描述生产者 - 消费者模型，其中GPU正在生成一些数据，并且CPU将不时地使用该数据。

The simplest way to implement this is to have the CPU be the master. 实现这一点的最简单方法是让CPU成为主服务器。 The CPU launches a kernel on the GPU, when it is ready to ready to consume data (ie the while loop in your example) it synchronises with the GPU, copies the data back from the GPU, launches the kernel again to generate more data, and does whatever it has to do with the data it copied. CPU在GPU上启动内核，当它准备好使用数据时（例如，示例中的while循环），它与GPU同步，从GPU复制数据，再次启动内核以生成更多数据，并做任何与它复制的数据有关的事情。 This allows you to have the GPU filling a fixed-size buffer while the CPU is processing the previous batch (since there are two copies, one on the GPU and one on the CPU). 这允许您在CPU处理上一批时使GPU填充固定大小的缓冲区（因为有两个副本，一个在GPU上，一个在CPU上）。

That can be improved upon by double-buffering the data, meaning that you can keep the GPU busy producing data 100% of the time by ping-ponging between buffers as you copy the other to the CPU. 这可以通过对数据进行双缓冲来改进，这意味着当您将另一个复制到CPU时，可以通过缓冲区之间的乒乓来保持GPU在100％的时间内忙于生成数据。 That assumes the copy-back is faster than the production, but if not then you will saturate the copy bandwidth which is also good. 这假设复制比生产更快，但如果没有，那么你将使拷贝带宽饱和，这也是好的。

Neither of those are what you actually described. 这些都不是你实际描述的。 What you asked for is to have the GPU master the data. 您要求的是让GPU掌握数据。 I'd urge caution on that since you will need to manage your buffer size carefully and you will need to think carefully about the timings and communication issues. 我要小心谨慎，因为您需要仔细管理缓冲区大小，并且需要仔细考虑时间和通信问题。 It's certainly possible to do something like that but before you explore that direction you should read up about memory fences, atomic operations, and volatile . 当然可以做类似的事情，但在探索这个方向之前，你应该阅读有关内存栅栏，原子操作volatile 。

Answer 2

I'd try to add 我试着补充一下

__threadfence_system();

after 后

*h_i = 2*i + 1;

See here for details. 详情请见此处。 Without it, it's totally possible that the modification stay in the GPU cache forever. 没有它，修改完全可能永远留在GPU缓存中。 However better you listen to the other answers: to improve it for multiple threads/blocks you have to deal with other "problems" to get a similar scheme to work reliably. 无论你听到其他答案还是更好：要为多个线程/块改进它，你必须处理其他“问题”，以使类似的方案可靠地工作。

As Tom suggested (+1), better to use double buffering. 正如汤姆建议的那样（+1），最好使用双缓冲。 Streams help a lot such a scheme, as you can find depicted here . Streams帮助了很多这样的方案，正如你在这里描述的那样。

Cuda统一了gpu和主机之间的内存

问题描述

2 个解决方案

解决方案1
4 2014-05-02 08:57:40

解决方案2
2 2014-05-02 10:38:18

Cuda统一了gpu和主机之间的内存

问题描述

2 个解决方案

解决方案1 4 2014-05-02 08:57:40

解决方案2 2 2014-05-02 10:38:18

解决方案1
4 2014-05-02 08:57:40

解决方案2
2 2014-05-02 10:38:18