CUDA設備到主機的復制非常慢

Question

我正在運行Windows 7 64位，CUDA 4.2，Visual Studio 2010。

首先，我在cuda上運行一些代碼，然后將數據下載回主機。 然后進行一些處理，然后移回設備。 然后，我從設備到主機進行了以下復制，它運行非常快，如1ms。

clock_t start, end;
count=1000000;
thrust::host_vector <int> h_a(count);
thrust::device_vector <int> d_b(count,0);
int *d_bPtr = thrust::raw_pointer_cast(&d_b[0]);
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

大約需要1毫秒才能完成。

然后我再次在cuda上運行了其他一些代碼，主要是原子操作。 然后，我將數據從設備復制到主機，這需要很長時間，例如〜9s。

__global__ void dosomething(int *d_bPtr)
{
....
atomicExch(d_bPtr,c)
....
}

start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

〜9秒

例如，我多次運行代碼

int i=0;
while (i<10)
{
clock_t start, end;
count=1000000;
thrust::host_vector <int> h_a(count);
thrust::device_vector <int> d_b(count,0);
int *d_bPtr = thrust::raw_pointer_cast(&d_b[0]);
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

__global__ void dosomething(int *d_bPtr)
{
....
atomicExch(d_bPtr,c)
....
}

start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;
i++
}

結果幾乎相同。
可能是什么問題呢？

謝謝！

Answer 1

問題是時間問題之一，而不是復印性能的任何變化。 內核啟動在CUDA中是異步的，因此您要測量的不僅是thrust::copy的時間，而且還包括啟動之前完成的內核的時間。 如果更改代碼，以將復制操作計時為類似以下內容：

cudaDeviceSynchronize(); // wait until prior kernel is finished
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

您應該找到傳輸時間恢復到以前的性能。 因此，您真正的問題不是“為什么thrust::copy慢”，而是“為什么我的內核慢”。 根據您發布的相當糟糕的偽代碼，答案是“因為它充滿了對內核內存事務進行序列化的atomicExch()調用”。

Answer 2

我建議您使用cudpp ，我認為這比推力要快（我正在寫有關優化的碩士論文，並且嘗試了兩個庫）。 如果復制非常慢，則可以嘗試編寫自己的內核來復制數據。

CUDA設備到主機的復制非常慢

問題描述

2 個解決方案

解決方案1
10 已采納 2012-10-09 05:05:50

解決方案2
0 2012-10-09 05:50:14

CUDA設備到主機的復制非常慢

問題描述

2 個解決方案

解決方案1 10 已采納 2012-10-09 05:05:50

解決方案2 0 2012-10-09 05:50:14

解決方案1
10 已采納 2012-10-09 05:05:50

解決方案2
0 2012-10-09 05:50:14