简体   繁体   English

CUDA推力库中的功能是否隐式同步?

[英]Are functions in CUDA thrust library synchronized implicitly?

I met some problems When using functions in thrust library, I am not sure if I should add cudaDeviceSynchronize manually before it. 我在推力库中使用函数时遇到了一些问题,我不确定是否应该在其之前手动添加cudaDeviceSynchronize。 For example, 例如,

double dt = 0;
kernel_1<<<blocks, threads>>>(it);
dt = *(thrust::max_element(it, it + 10));
printf("%f\n", dt);

Since kernel_1 is non-blocking, host will execute the next statement. 由于kernel_1是非阻塞的,因此主机将执行下一条语句。 The problem is I am not sure if the thrust::max_element is blocking. 问题是我不确定推力::: max_element是否阻塞。 If it is blocking, then it works well; 如果阻塞,则效果很好; otherwise, will host skip it and execute the "printf" statement? 否则,主机将跳过它并执行“ printf”语句吗?

Thanks 谢谢

Your code is broken in at least 2 ways. 您的代码至少有2种方式被破坏。

  1. it is presumably a device pointer: it大概是一个设备指针:

     kernel_1<<<blocks, threads>>>(it); ^^ 

    it is not allowed to use a raw device pointer as an argument to a thrust algorithm: 不允许将原始设备指针用作推力算法的参数:

     dt = *(thrust::max_element(it, it + 10)); ^^ 

    unless you wrap that pointer in a thrust::device_ptr or else use the thrust::device execution policy explicitly as an argument to the algorithm. 除非您将该指针包装在thrust::device_ptr ,否则将显式使用thrust::device执行策略作为算法的参数。 Without any of these clues, thrust will dispatch the host code path (which will probably seg fault) as discussed in the thrust quick start guide . 如果没有这些线索,则推力将调度推力快速入门指南中讨论的主机代码路径(可能会段故障)。

  2. If you fixed the above item using either thrust::device_ptr or thrust::device , then thrust::max_element will return an iterator of a type consistent with the iterators passed to it. 如果使用thrust::device_ptrthrust::device固定了以上项,则thrust::max_element 将返回类型与传递给它的迭代器一致的迭代器。 If you pass a thrust::device_ptr it will return a thrust::device_ptr . 如果您传递了一个thrust::device_ptr ,它将返回一个thrust::device_ptr If you use thrust::device with your raw pointer, it will return a raw pointer. 如果您将thrust::device与原始指针一起使用,它将返回原始指针。 In either case, it is illegal to dereference such in host code: 无论哪种情况,在主机代码中取消引用都是非法的:

     dt = *(thrust::max_element(it, it + 10)); ^ 

    again, I would expect such usage to seg fault. 再次,我希望这种用法会导致故障。

Regarding asynchrony, it is safe to assume that all thrust algorithms that return a scalar quantity stored in stack variable are synchronous. 关于异步,可以安全地假定所有返回存储在堆栈变量中的标量的推力算法都是同步的。 That means the CPU thread will not proceed beyond the thrust call until the stack variable has been populated with the correct value 这意味着,直到用正确的值填充了堆栈变量后,CPU线程才会继续进行推力调用

Regarding GPU activity in general, unless you use streams, all GPU activity is issued to the same (default) stream. 通常,关于GPU活动,除非您使用流,否则所有GPU活动都会发布到同一(默认)流中。 This means that all CUDA activity will be executed in-order, and a given CUDA operation will not begin until the preceding CUDA activity is complete. 这意味着所有CUDA活动将按顺序执行,并且给定的CUDA操作将在之前的CUDA活动完成之前开始。 Therefore, even though your kernel launch is asynchronous, and the CPU thread will proceed onto the thrust::max_element call, any CUDA activity spawned from that call will not begin executing until the previous kernel launch is complete. 因此,即使您的内核启动是异步的,并且CPU线程将继续进行thrust::max_element调用,从该调用产生的任何CUDA活动也不会在之前的内核启动完成之前开始执行。 Therefore, any changes made to the data referenced by it by kernel_1 should be finished and completely valid before any CUDA processing in thrust::max_element begins. 因此,被引用的数据所做的任何更改it通过kernel_1应该完成,并在任何CUDA处理之前完全有效的thrust::max_element开始。 And as we've seen, thrust::max_element itself will insert synchronization. 正如我们所看到的, thrust::max_element本身将插入同步。

So once you fix the defects in your code, there should not be any requirement to insert additional synchronization anywhere. 因此,一旦修复了代码中的缺陷,就无需在任何地方插入其他同步。

This function does not seem to be async. 此功能似乎不是异步的。

Both of these pages explain the behaviour of max_element() and they do not explicit it as async, so I would assume it is blocking : 这两个页面都解释了max_element()的行为,并且没有将其显式表示为异步,因此我认为它正在阻塞:

Since it is using an iterator to treat all the elements and find the maximum of the values, I can not think about it to be async. 由于它使用迭代器来处理所有元素并找到最大值,因此我无法认为它是异步的。

You can still use cudaDeviceSynchronize to try it for real, but do not forget to set the corresponding flag on your device. 您仍然可以使用cudaDeviceSynchronize对其进行实际尝试,但不要忘记在设备上设置相应的标志

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM