CUDA 将错误显示为矩阵的“无效参数” - 乘以 N 次

Question

I am trying to multiply matrix A (n times) with matrix B. I have used kernel for matrix multiplication and using stream to do this multiplication N times.我正在尝试将矩阵 A（n 次）与矩阵 B 相乘。我使用内核进行矩阵乘法并使用流进行 N 次乘法。 I have 3 conditions to test consequently.因此，我有 3 个条件要测试。 My 1st condition is running successfully.我的第一个条件运行成功。

I don't know why it is showing error of "Invalid Argument" in the second condition iteration.我不知道为什么它在第二次条件迭代中显示“无效参数”错误。 I am guessing the I am not properly cleaning my memory.我猜我没有正确清理我的记忆。 I have done my best to free all host and device variables.我已尽力释放所有主机和设备变量。 Also tried CUDA device reset, nothing helps.还尝试了 CUDA 设备重置，没有任何帮助。 Can anyone help me debug this?谁能帮我调试一下？

Please find the portion of my code here:请在这里找到我的代码部分：

int main(){
    
    
    for (int i = 0; i < 3; i++) {
        
      
      for (int ind = 0; ind < itr; ind++){
          cudaStreamCreate(&(stream[ind]));
      }
      cudaCheckErrors("cudaStreamCreate fail");

      for (int ind = 0; ind < itr; ind++){
          cudaMemcpyAsync(d_a[ind], h_a[ind], bytes_a, cudaMemcpyHostToDevice, stream[ind]);
      }
      cudaDeviceSynchronize();

      for (int ind = 0; ind < itr; ind++){
          // Launch our kernel
          matrixMul<<<BLOCKS, THREADS, 0, stream[ind]>>>(d_a[ind], b, d_c[ind], M, K, N);
      }
      cudaDeviceSynchronize();
      cudaCheckErrors("kernel fail");

      for (int ind = 0; ind < itr; ind++){
          cudaMemcpyAsync(h_c[ind], d_c[ind], bytes_c, cudaMemcpyDeviceToHost, stream[ind]);
      }

      for (int ind = 0; ind < itr; ind++){
          cudaStreamSynchronize(stream[ind]);
      }
        
      cudaEventRecord( stop, 0 );
      cudaEventSynchronize( stop );

      cudaEventDestroy( start );
      cudaEventDestroy( stop);

      // Free allocated memory ****The issue was here.******
      cudaFreeHost(h_a);
      cudaFree(b);
      cudaFreeHost(h_c);
      cudaFree(d_a);
      cudaFree(d_c);
      cudaDeviceReset();
    }

    return 0;
}

In second iteration I was getting error as:在第二次迭代中，我收到错误消息：

Fatal error: cudaStreamCreate fail (invalid argument at /tmp/tmpwgpzgk9m/73a7502c-7662-4e80-804e-4debff15dc45.cu:140)
*** FAILED - ABORTING

SOlved:解决了：

The error was coming due to memory leakage.由于内存泄漏而出现错误。 I was allocating the array pointers but was only freeing 1st one.我正在分配数组指针，但只释放了第一个指针。 As per suggestion from below answer from Robert, the memory should be for each index of the array.根据罗伯特的以下回答的建议，内存应该用于数组的每个索引。 And also please always use proper error in cuda like this并且请始终像这样在 cuda 中使用正确的错误

. .

Answer 1

Suggestion: Implement proper CUDA error checking .建议：实施适当的 CUDA 错误检查。 Use it on every cuda call.在每次 cuda 调用中使用它。 Your haphazard use of the error checking macro makes for a confusing output that seems to suggest a problem with stream creation.您对错误检查宏的随意使用会导致令人困惑的输出，这似乎表明流创建存在问题。

That is not the case.事实并非如此。 The invalid argument is arising from your freeing operations at the end of the loop.无效参数是由您在循环结束时的释放操作引起的。 You have a number of errors:你有很多错误：

We don't don't use cudaFreeHost on a pointer returned by malloc , or on a pointer that is actually a stack array.我们不会在malloc返回的指针或实际上是堆栈数组的指针上使用cudaFreeHost 。
You don't use cudaFree on a pointer that is actually a stack array.您不会在实际上是堆栈数组的指针上使用cudaFree 。
If you have done allocations in a loop, you are likely going to have to do free operations in a loop.如果您在循环中进行了分配，则可能必须在循环中进行自由操作。
Even with your use of cudaDeviceReset (which frees all device allocations anyway), you have a memory leak because of improper freeing of the malloc allocations.即使您使用了cudaDeviceReset （无论如何都会释放所有设备分配），由于malloc分配的释放不当，您cudaDeviceReset出现内存泄漏。

By modifying the end of your code as follows:通过如下修改代码的结尾：

  ...
  cudaEventDestroy( start );
  cudaEventDestroy( stop);

  for (int ind = 0; ind < itr; ind++){
      free(h_a[ind]);
      free(h_c[ind]);
      cudaFree(d_a[ind]);
      cudaFree(d_c[ind]);
  }
  // Free allocated memory
  cudaFree(b);
  cudaDeviceReset();
}
...

I was able to make the above errors disappear.我能够使上述错误消失。

As an aside, it should not be necessary to create 5000 streams, but it appears to work so I'll leave it at that.顺便说一句，应该没有必要创建 5000 个流，但它似乎可以工作，所以我将保留它。 I would normally advise stream reuse.我通常会建议流重用。

Stream reuse could look something like this.流重用可能看起来像这样。 Instead of creating 5000 streams, pick a smaller number, like 5 (the exact number shouldn't matter much here. It's likely that anything in the range of 3 or greater will behave similarly).与其创建 5000 个流，不如选择一个较小的数字，例如 5（此处确切的数字应该无关紧要。很可能在 3 或更大范围内的任何内容的行为都会相似）。

Create that many streams:创建那么多流：

 const int max_streams = 5; for (int ind = 0; ind < max_streams; ind++){ cudaStreamCreate(&(stream[ind])); }

When it comes to using the streams, use modulo arithmetic to "rotate" through the streams:在使用流时，使用模算术在流中“旋转”：

 for (int ind = 0; ind < itr; ind++){ cudaMemcpyAsync(d_a[ind], h_a[ind], bytes_a, cudaMemcpyHostToDevice, stream[ind%max_streams]); } cudaDeviceSynchronize(); for (int ind = 0; ind < itr; ind++){ // Launch our kernel matrixMul<<<BLOCKS, THREADS, 0, stream[ind%max_streams]>>>(d_a[ind], b, d_c[ind], M, K, N); } cudaDeviceSynchronize(); ...

CUDA 将错误显示为矩阵的“无效参数” - 乘以 N 次

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-11-17 20:24:28

CUDA 将错误显示为矩阵的“无效参数” - 乘以 N 次

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-11-17 20:24:28

解决方案1
1 已采纳 2020-11-17 20:24:28