使用 CudaEventElapsedTime 测量 Cuda 内核时间

Question

I've got NVS 5400M and I'm trying to get reliable time measurement results for cuda addition on matrix (instance 1000 x 1000).我有 NVS 5400M，我正在尝试为矩阵上的 cuda 添加获得可靠的时间测量结果（实例 1000 x 1000）。

__global__ void MatAdd(int** A, int** B, int** C) {
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j]; }

And I'm doing measurement like:我正在做如下测量：

int numBlocks = 1;
dim3 threadsPerBlock(1000, 1000);

float time;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);

MatAdd <<<numBlocks, threadsPerBlock>>>(pA, pB, pC);

cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);

cout << setprecision(10) << "GPU Time [ms] " << time << endl;

and the result is: 0.001504000043 ms, which is relatively small.结果是：0.001504000043 ms，相对较小。 My question is am I doing it right?我的问题是我做得对吗？

Answer 1

Your timing is correct, but your usage of CUDA in general is not.您的时机是正确的，但您对 CUDA 的总体使用情况并非如此。

This is illegal:这是非法的：

dim3 threadsPerBlock(1000, 1000);

CUDA kernels are limited to a maximum of 1024 threads per block, but you are requesting 1000x1000 = 1,000,000 threads per block. CUDA 内核限制为每个块最多 1024 个线程，但您请求每个块 1000x1000 = 1,000,000 个线程。

As a result, your kernel is not actually launching:因此，您的内核实际上并未启动：

MatAdd <<<numBlocks, threadsPerBlock>>>(pA, pB, pC);

And so the measured time is quite short.所以测量的时间很短。

You are advised to use proper cuda error checking and run your tests with cuda-memcheck to make sure there are no reported runtime errors (my guess is right now you are not even aware of the errors being reported from your code - you have to check for them.)建议您使用适当的 cuda 错误检查并使用cuda-memcheck运行您的测试以确保没有报告运行时错误（我的猜测是现在您甚至不知道您的代码报告的错误 - 您必须检查为他们。）

Since you haven't shown a complete code, I'm not going to try to identify all other issues that may be present, but your kernel code would have to be re-factored in order to handle a 1000x1000 array properly, and passing double-pointer (eg int** A ) parameters to kernels is considerably more difficult than a single pointer or "flat" array.由于您没有展示完整的代码，我不会尝试确定可能存在的所有其他问题，但是您的内核代码必须重新分解才能正确处理 1000x1000 数组，并传递 double内核的 -pointer（例如int** A ）参数比单个指针或“平面”数组困难得多。

使用 CudaEventElapsedTime 测量 Cuda 内核时间

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-05-09 16:24:07

使用 CudaEventElapsedTime 测量 Cuda 内核时间

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-05-09 16:24:07

解决方案1
1 已采纳 2016-05-09 16:24:07