CUDA流比通常的内核慢

Question

I am trying to understand CUDA streams and I have made my first program with streams, but It is slower than usual kernel function... 我正在尝试理解CUDA流，并且我已经使用流制作了我的第一个程序，但它比通常的内核函数慢...

why is this code slower 为什么这段代码变慢了

cudaMemcpyAsync(pole_dev, pole, size, cudaMemcpyHostToDevice, stream_1);    
addKernel<<<count/100, 100, 0, stream_1>>>(pole_dev);
cudaMemcpyAsync(pole, pole_dev, size, cudaMemcpyDeviceToHost, stream_1);
cudaThreadSynchronize();  // I don't know difference between cudaThreadSync and cudaDeviceSync
cudaDeviceSynchronize();  // it acts relatively same...

than: 比：

cudaMemcpy(pole_dev, pole, size, cudaMemcpyHostToDevice);
addKernel<<<count/100, 100>>>(pole_dev);
cudaMemcpy(pole, pole_dev, size, cudaMemcpyDeviceToHost);

I thounght that it should run faster ... value of variable count is 6 500 000 (maximum) ... first source code takes 14 millisecconds and second source code takes 11 milliseconds. 我认为它应该运行得更快...变量计数的值是6 500 000（最大值）...第一个源代码需要14毫秒，第二个源代码需要11毫秒。

Can anybody explain it to me, please? 有人可以向我解释一下吗？

Answer 1

In this snippet you like dealing with only a single stream ( stream_1 ), but that's actually what CUDA automatically does for you when you don't explicitely manipulate streams. 在这个片段中，您只想处理一个流（ stream_1 ），但实际上，当您没有明确地操纵流时，CUDA会自动为您做什么。

To take advantage of streams and asynchronous memory transfers, you need to use several streams, and split your data and computations through each of them. 要利用流和异步内存传输，您需要使用多个流，并通过每个流分割数据和计算。

CUDA流比通常的内核慢

问题描述

1 个解决方案

解决方案1
2 2012-12-09 09:58:38

CUDA流比通常的内核慢

问题描述

1 个解决方案

解决方案1 2 2012-12-09 09:58:38

解决方案1
2 2012-12-09 09:58:38