简体   繁体   English

CUDA流比通常的内核慢

[英]CUDA stream is slower than usual kernel

I am trying to understand CUDA streams and I have made my first program with streams, but It is slower than usual kernel function... 我正在尝试理解CUDA流,并且我已经使用流制作了我的第一个程序,但它比通常的内核函数慢...

why is this code slower 为什么这段代码变慢了

cudaMemcpyAsync(pole_dev, pole, size, cudaMemcpyHostToDevice, stream_1);    
addKernel<<<count/100, 100, 0, stream_1>>>(pole_dev);
cudaMemcpyAsync(pole, pole_dev, size, cudaMemcpyDeviceToHost, stream_1);
cudaThreadSynchronize();  // I don't know difference between cudaThreadSync and cudaDeviceSync
cudaDeviceSynchronize();  // it acts relatively same...

than: 比:

cudaMemcpy(pole_dev, pole, size, cudaMemcpyHostToDevice);
addKernel<<<count/100, 100>>>(pole_dev);
cudaMemcpy(pole, pole_dev, size, cudaMemcpyDeviceToHost);

I thounght that it should run faster ... value of variable count is 6 500 000 (maximum) ... first source code takes 14 millisecconds and second source code takes 11 milliseconds. 我认为它应该运行得更快...变量计数的值是6 500 000(最大值)...第一个源代码需要14毫秒,第二个源代码需要11毫秒。

Can anybody explain it to me, please? 有人可以向我解释一下吗?

In this snippet you like dealing with only a single stream ( stream_1 ), but that's actually what CUDA automatically does for you when you don't explicitely manipulate streams. 在这个片段中,您只想处理一个流( stream_1 ),但实际上,当您没有明确地操纵流时,CUDA会自动为您做什么。

To take advantage of streams and asynchronous memory transfers, you need to use several streams, and split your data and computations through each of them. 要利用流和异步内存传输,您需要使用多个流,并通过每个流分割数据和计算。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM