简体   繁体   English

CUDA:将 arguments 传递给 kernel 会减慢 Z50484C19F1AFDAF3841A0D821ED393D2 的启动速度吗?

[英]CUDA: Does passing arguments to a kernel slow the kernel launch much?

CUDA beginner here. CUDA 初学者在这里。

In my code i am currently launching kernels a lot of times in a loop in the host code.在我的代码中,我目前在主机代码的循环中多次启动内核。 (Because i need synchronization between blocks). (因为我需要块之间的同步)。 So i wondered if i might be able to optimize the kernel launch.所以我想知道我是否可以优化 kernel 的发布。

My kernel launches look something like this:我的 kernel 启动看起来像这样:

MyKernel<<<blocks,threadsperblock>>>(double_ptr, double_ptr, int N, double x);

So to launch a kernel some signal obviously has to go from the CPU to the GPU, but i'm wondering if the passing of arguments make this process noticeably slower. So to launch a kernel some signal obviously has to go from the CPU to the GPU, but i'm wondering if the passing of arguments make this process noticeably slower.

The arguments to the kernel are the same every single time, so perhaps i could save time by copying them once, access them in the kernel by a name defined by arguments 到 kernel 每次都是相同的,所以也许我可以通过复制一次来节省时间,在 kernel 中通过定义的名称访问它们

__device__ int N;
<and somehow (how?) copy the value to this name N on the GPU once>

and simply launch the kernel with no arguments as such并简单地启动没有 arguments 的 kernel

MyKernel<<<blocks,threadsperblock>>>();

Will this make my program any faster?这会让我的程序更快吗? What is the best way of doing this?这样做的最佳方法是什么? AFAIK the arguments are stored in some constant global memory. AFAIK arguments 存储在一些常量全局 memory 中。 How can i make sure that the manually transferred values are stored in a memory which is as fast or faster?我如何确保手动传输的值存储在速度相同或更快的 memory 中?

Thanks in advance for any help.提前感谢您的帮助。

I would expect the benefits of such an optimization to be rather small.我希望这种优化的好处相当小。 On sane platforms (ie. anything other than WDDM), kernel launch overhead is only of the order of 10-20 microseconds, so there probably isn't a lot of scope to improve.在健全的平台上(即除 WDDM 之外的任何平台),kernel 启动开销仅为 10-20 微秒左右,因此可能没有太多 scope 需要改进。

Having said that, if you want to try, the logical way to affect this is using constant memory.话虽如此,如果您想尝试,影响这一点的合乎逻辑的方法是使用常量 memory。 Define each argument as a __constant__ symbol at translation unit scope, then use the cudaMemcpyToSymbol function to copy values from the host to device constant memory.在翻译单元 scope 将每个参数定义为__constant__符号,然后使用cudaMemcpyToSymbol function 将值从主机复制到设备常量 ZCD69B4957F06CD818D7BF3D61980。

Simple answer: no.简单的回答:没有。

To be more elaborate: You need to send some signals from host to the GPU anyway, to launch the kernel itself.更详细地说:无论如何,您需要从主机向 GPU 发送一些信号,以启动 kernel 本身。 At this point, few more bytes of parameter data does not matter anymore.此时,再多几个字节的参数数据就不再重要了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM