简单 cuda kernel 添加：非法 memory 后 2432 Z50484C19F2139F3841ZA0D3 调用

Question

I build a simple cuda kernel that performs a sum on elements.我构建了一个简单的 cuda kernel 对元素进行求和。 Each thread adds an input value to an output buffer.每个线程将输入值添加到 output 缓冲区。 Each thread calculates one value.每个线程计算一个值。 2432 threads are being used (19 blocks * 128 threads).正在使用 2432 个线程（19 个块 * 128 个线程）。

The output buffer remains the same, the input buffer pointer is shifted by threadcount after each kernel execution. output 缓冲区保持不变，输入缓冲区指针在每次 kernel 执行后移动线程数。 So in total, we have a loop invoking the add kernel until we computed all input data.所以总的来说，我们有一个循环调用 add kernel 直到我们计算出所有输入数据。

Example: All my input values are set to 1. The output buffer size is 2432. The input buffer size is 2432 *2000.示例：我所有的输入值都设置为 1。output 缓冲区大小为 2432。输入缓冲区大小为 2432 *2000。 2000 times the add kernel is called to add 1 to each field of output.调用 add kernel 2000 次，将 output 的每个字段加 1。 The endresult in output is 2000 at every field. output 的最终结果在每个领域都是 2000。 I call the function aggregate which contains a for loop, calling the kernel as often as needed to pass over the complete input data.我调用包含 for 循环的 function 聚合，并根据需要经常调用 kernel 以传递完整的输入数据。 This works so far unless I call the kernel too often.到目前为止，除非我过于频繁地调用 kernel，否则此方法有效。

However if I call the Kernel 2500 times, I get an illegalmemoryaccess cuda error.但是，如果我调用 Kernel 2500 次，我会收到非法内存访问 cuda 错误。

As you can see, the runtime of the last successfull kernel increases by 3 orders of magnitude.可以看到，最后一个成功的 kernel 的运行时间增加了 3 个数量级。 Afterwards my pointers are invalidated and the following invocations result in CudaErrorIllegalAdress.之后我的指针无效，以下调用导致 CudaErrorIllegalAdress。

I cleaned up the code to get a minimal working example:我清理了代码以获得一个最小的工作示例：

 #include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <vector>
#include <stdio.h>
#include <iostream>

using namespace std;

template <class T> __global__ void addKernel_2432(int *in, int * out)
{
    int i = blockIdx.x * blockDim.x  + threadIdx.x;
    out[i] = out[i] + in[i];
}


static int aggregate(int* array, size_t size, int* out) {

    

    size_t const vectorCount = size / 2432;
    cout << "ITERATIONS: " << vectorCount << endl;
    
    
    for (size_t i = 0; i < vectorCount-1; i++)
    {

         addKernel_2432<int><<<19,128>>>(array, out);
        
        array += vectorCount;
       
    }
    addKernel_2432<int> << <19, 128 >> > (array, out);
    return 1;
    }

    int main()
    {
  
    int* dev_in1 = 0;
    size_t vectorCount = 2432;
    int * dev_out = 0;
    size_t datacount = 2432*2500;
   
    std::vector<int> hostvec(datacount);
   
    //create input buffer, filled with 1
    std::fill(hostvec.begin(), hostvec.end(), 1);
    
    //allocate input buffer and output buffer
    cudaMalloc(&dev_in1, datacount*sizeof(int));
    cudaMalloc(&dev_out, vectorCount * sizeof(int));

    //set output buffer to 0
    cudaMemset(dev_out, 0, vectorCount * sizeof(int));

    //copy input buffer to GPU
    cudaMemcpy(dev_in1, hostvec.data(), datacount * sizeof(int), cudaMemcpyHostToDevice);
    
    //call kernel datacount / vectorcount times
    aggregate(dev_in1, datacount, dev_out);
    
    //return data to check for corectness
    cudaMemcpy(hostvec.data(), dev_out, vectorCount*sizeof(int), cudaMemcpyDeviceToHost);
   
    if (cudaSuccess != cudaMemcpy(hostvec.data(), dev_out, vectorCount * sizeof(int), cudaMemcpyDeviceToHost))
    {
        cudaError err = cudaGetLastError();
        cout << " CUDA ERROR: " << cudaGetErrorString(err) << endl;
    }
    else
    {
        cout << "NO CUDA ERROR" << endl;
        cout << "RETURNED SUM DATA" << endl;
        for (int i = 0; i < 2432; i++)
        {
            cout << hostvec[i] << " ";
        }

    }
   
    cudaDeviceReset();
    return 0;
}

If you compile and run it, you get an error.如果你编译并运行它，你会得到一个错误。 Change:改变：

size_t datacount = 2432 * 2500; size_t 数据计数 = 2432 * 2500；

to至

size_t datacount = 2432 * 2400; size_t 数据计数 = 2432 * 2400；

and it gives the correct results.它给出了正确的结果。

I am looking for any ideas, why it breaks after 2432 kernel invocations.我正在寻找任何想法，为什么它在 2432 kernel 调用后中断。

What i have found so far googeling around: Wrong target architecture set.到目前为止我在谷歌上发现了什么：错误的目标架构集。 I use a 1070ti.我用的是1070ti。 My target is set to: compute_61,sm_61 In visual studio project properties.我的目标设置为：compute_61,sm_61 在 Visual Studio 项目属性中。 That does not change anything.这不会改变任何事情。

Did I miss something?我错过了什么？ Is there a limit how many times a kernel can be called until cuda invalidates pointer?在 cuda 使指针无效之前，可以调用 kernel 的次数是否有限制？ Thank you for your help.谢谢您的帮助。 I used windows, Visual Studio 2019 and CUDA runtime 11.我使用了 windows、Visual Studio 2019 和 CUDA 运行时 11。

This is the output in both cases.在这两种情况下，这都是 output。 Succes and failure:成功与失败：

[ [ 成功2400元素

Error: [错误： [ 错误 2500 个元素

Answer 1

static int aggregate(int* array, size_t size, int* out) {
    size_t const vectorCount = size / 2432;
    for (size_t i = 0; i < vectorCount-1; i++)
    {
        array += vectorCount;
    }
}

That's not vectorCount but the number of iterations you have been accidentally incrementing by.那不是vectorCount ，而是您意外增加的迭代次数。 Works fine while vectorCount <= 2432 (but yields wrong results), and results in buffer overflow above.在vectorCount <= 2432时工作正常（但产生错误的结果），并导致上面的缓冲区溢出。

array += 2432 is what you intended to write. array += 2432是你打算写的。

简单 cuda kernel 添加：非法 memory 后 2432 Z50484C19F2139F3841ZA0D3 调用

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-08-05 12:16:41

简单 cuda kernel 添加：非法 memory 后 2432 Z50484C19F2139F3841ZA0D3 调用

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-08-05 12:16:41

解决方案1
1 已采纳 2020-08-05 12:16:41