從 GPU 復制到 CPU 比將 CPU 復制到 GPU 慢

Question

我開始學習cuda有一段時間了，我有以下問題

在下面看看我是怎么做的：

復制 GPU

int* B;
// ...
int *dev_B;    
//initialize B=0

cudaMalloc((void**)&dev_B, Nel*Nface*sizeof(int));
cudaMemcpy(dev_B, B, Nel*Nface*sizeof(int),cudaMemcpyHostToDevice);
//...

//Execute on GPU the following function which is supposed to fill in 
//the dev_B matrix with integers


findNeiborElem <<< Nblocks, Nthreads >>>(dev_B, dev_MSH, dev_Nel, dev_Npel, dev_Nface, dev_FC);

再次復制CPU

cudaMemcpy(B, dev_B, Nel*Nface*sizeof(int),cudaMemcpyDeviceToHost);

將數組 B 復制到 dev_B 只需幾分之一秒。 但是將數組 dev_B 復制回 B 需要很長時間。

findNeiborElem 函數涉及每個線程的循環，例如它看起來像這樣

__ global __ void findNeiborElem(int *dev_B, int *dev_MSH, int *dev_Nel, int *dev_Npel, int *dev_Nface, int *dev_FC){ int tid=threadIdx.x + blockIdx.x * blockDim.x; while (tid<dev_Nel[0]){ for (int j=1;j<=Nel;j++){ // do some calculations B[ind(tid,1,Nel)]=j// j in most cases do no go all the way to the Nel reach break; } tid += blockDim.x * gridDim.x; } }

非常奇怪的是，將 dev_B 復制到 B 的時間與 j 索引的迭代次數成正比。

例如，如果Nel=5則時間約為5 sec 。

當我增加Nel=20 ，時間約為20 sec 。

我希望復制時間應該獨立於需要分配 Matrix dev_B值的內部迭代。

此外，我希望從 CPU 復制相同矩陣的時間和向 CPU 復制相同矩陣的時間順序相同。

你知道有什么問題嗎？

Answer 1

您應該使用事件而不是使用 clock() 來測量時間：

有了事件，你會有這樣的事情：

  cudaEvent_t start, stop;   // variables that holds 2 events 
  float time;                // Variable that will hold the time
  cudaEventCreate(&start);   // creating the event 1
  cudaEventCreate(&stop);    // creating the event 2
  cudaEventRecord(start, 0); // start measuring  the time

  // What you want to measure
  cudaMalloc((void**)&dev_B, Nel*Nface*sizeof(int));
  cudaMemcpy(dev_B, B, Nel*Nface*sizeof(int),cudaMemcpyHostToDevice);

  cudaEventRecord(stop, 0);                  // Stop time measuring
  cudaEventSynchronize(stop);               // Wait until the completion of all device 
                                            // work preceding the most recent call to cudaEventRecord()

  cudaEventElapsedTime(&time, start, stop); // Saving the time measured

編輯：附加信息：

“內核啟動在完成之前將控制權返回給 CPU 線程。因此，您的計時結構正在測量內核執行時間以及第二個 memcpy。在內核之后對副本計時時，您的計時器代碼將立即執行，但是cudaMemcpy 正在等待內核在啟動之前完成。這也解釋了為什么您對數據返回的計時測量似乎因內核循環迭代而異。它還解釋了為什么花在內核函數上的時間“可以忽略不計”。 歸功於羅伯特·克羅維拉

Answer 2

至於你的第二個問題

 B[ind(tid,1,Nel)]=j// j in most cases do no go all the way to the Nel reach

在 GPU 上執行計算時，由於同步原因，每個完成工作的線程都不會執行任何計算，直到同一工作組中的所有線程都完成。

換句話說，您執行此計算所需的時間將是最壞情況下的時間，如果大多數線程沒有一路下降也沒關系。

我不確定你的第一個問題，你如何衡量時間？ 我對 cuda 不太熟悉，但我認為當從 CPU 復制到 GPU 時，實現會緩沖您的數據，隱藏所涉及的有效時間。

從 GPU 復制到 CPU 比將 CPU 復制到 GPU 慢

問題描述

2 個解決方案

解決方案1
3 已采納 2012-11-12 15:01:49

解決方案2
1 2012-11-12 09:52:53

從 GPU 復制到 CPU 比將 CPU 復制到 GPU 慢

問題描述

2 個解決方案

解決方案1 3 已采納 2012-11-12 15:01:49

解決方案2 1 2012-11-12 09:52:53

解決方案1
3 已采納 2012-11-12 15:01:49

解決方案2
1 2012-11-12 09:52:53