簡單 cuda kernel 添加：非法 memory 后 2432 Z50484C19F2139F3841ZA0D3 調用

Question

我構建了一個簡單的 cuda kernel 對元素進行求和。 每個線程將輸入值添加到 output 緩沖區。 每個線程計算一個值。 正在使用 2432 個線程（19 個塊 * 128 個線程）。

output 緩沖區保持不變，輸入緩沖區指針在每次 kernel 執行后移動線程數。 所以總的來說，我們有一個循環調用 add kernel 直到我們計算出所有輸入數據。

示例：我所有的輸入值都設置為 1。output 緩沖區大小為 2432。輸入緩沖區大小為 2432 *2000。 調用 add kernel 2000 次，將 output 的每個字段加 1。 output 的最終結果在每個領域都是 2000。 我調用包含 for 循環的 function 聚合，並根據需要經常調用 kernel 以傳遞完整的輸入數據。 到目前為止，除非我過於頻繁地調用 kernel，否則此方法有效。

但是，如果我調用 Kernel 2500 次，我會收到非法內存訪問 cuda 錯誤。

可以看到，最后一個成功的 kernel 的運行時間增加了 3 個數量級。 之后我的指針無效，以下調用導致 CudaErrorIllegalAdress。

我清理了代碼以獲得一個最小的工作示例：

 #include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <vector>
#include <stdio.h>
#include <iostream>

using namespace std;

template <class T> __global__ void addKernel_2432(int *in, int * out)
{
    int i = blockIdx.x * blockDim.x  + threadIdx.x;
    out[i] = out[i] + in[i];
}


static int aggregate(int* array, size_t size, int* out) {

    

    size_t const vectorCount = size / 2432;
    cout << "ITERATIONS: " << vectorCount << endl;
    
    
    for (size_t i = 0; i < vectorCount-1; i++)
    {

         addKernel_2432<int><<<19,128>>>(array, out);
        
        array += vectorCount;
       
    }
    addKernel_2432<int> << <19, 128 >> > (array, out);
    return 1;
    }

    int main()
    {
  
    int* dev_in1 = 0;
    size_t vectorCount = 2432;
    int * dev_out = 0;
    size_t datacount = 2432*2500;
   
    std::vector<int> hostvec(datacount);
   
    //create input buffer, filled with 1
    std::fill(hostvec.begin(), hostvec.end(), 1);
    
    //allocate input buffer and output buffer
    cudaMalloc(&dev_in1, datacount*sizeof(int));
    cudaMalloc(&dev_out, vectorCount * sizeof(int));

    //set output buffer to 0
    cudaMemset(dev_out, 0, vectorCount * sizeof(int));

    //copy input buffer to GPU
    cudaMemcpy(dev_in1, hostvec.data(), datacount * sizeof(int), cudaMemcpyHostToDevice);
    
    //call kernel datacount / vectorcount times
    aggregate(dev_in1, datacount, dev_out);
    
    //return data to check for corectness
    cudaMemcpy(hostvec.data(), dev_out, vectorCount*sizeof(int), cudaMemcpyDeviceToHost);
   
    if (cudaSuccess != cudaMemcpy(hostvec.data(), dev_out, vectorCount * sizeof(int), cudaMemcpyDeviceToHost))
    {
        cudaError err = cudaGetLastError();
        cout << " CUDA ERROR: " << cudaGetErrorString(err) << endl;
    }
    else
    {
        cout << "NO CUDA ERROR" << endl;
        cout << "RETURNED SUM DATA" << endl;
        for (int i = 0; i < 2432; i++)
        {
            cout << hostvec[i] << " ";
        }

    }
   
    cudaDeviceReset();
    return 0;
}

如果你編譯並運行它，你會得到一個錯誤。 改變：

size_t 數據計數 = 2432 * 2500；

至

size_t 數據計數 = 2432 * 2400；

它給出了正確的結果。

我正在尋找任何想法，為什么它在 2432 kernel 調用后中斷。

到目前為止我在谷歌上發現了什么：錯誤的目標架構集。 我用的是1070ti。 我的目標設置為：compute_61,sm_61 在 Visual Studio 項目屬性中。 這不會改變任何事情。

我錯過了什么？ 在 cuda 使指針無效之前，可以調用 kernel 的次數是否有限制？ 謝謝您的幫助。 我使用了 windows、Visual Studio 2019 和 CUDA 運行時 11。

在這兩種情況下，這都是 output。 成功與失敗：

[ 成功2400元素

錯誤： [ 錯誤 2500 個元素

Answer 1

static int aggregate(int* array, size_t size, int* out) {
    size_t const vectorCount = size / 2432;
    for (size_t i = 0; i < vectorCount-1; i++)
    {
        array += vectorCount;
    }
}

那不是vectorCount ，而是您意外增加的迭代次數。 在vectorCount <= 2432時工作正常（但產生錯誤的結果），並導致上面的緩沖區溢出。

array += 2432是你打算寫的。

簡單 cuda kernel 添加：非法 memory 后 2432 Z50484C19F2139F3841ZA0D3 調用

問題描述

1 個解決方案

解決方案1
1 已采納 2020-08-05 12:16:41

簡單 cuda kernel 添加：非法 memory 后 2432 Z50484C19F2139F3841ZA0D3 調用

問題描述

1 個解決方案

解決方案1 1 已采納 2020-08-05 12:16:41

解決方案1
1 已采納 2020-08-05 12:16:41