[英]Simple cuda kernel add: Illegal memory after 2432 kernel calls
I build a simple cuda kernel that performs a sum on elements.我构建了一个简单的 cuda kernel 对元素进行求和。 Each thread adds an input value to an output buffer.
每个线程将输入值添加到 output 缓冲区。 Each thread calculates one value.
每个线程计算一个值。 2432 threads are being used (19 blocks * 128 threads).
正在使用 2432 个线程(19 个块 * 128 个线程)。
The output buffer remains the same, the input buffer pointer is shifted by threadcount after each kernel execution. output 缓冲区保持不变,输入缓冲区指针在每次 kernel 执行后移动线程数。 So in total, we have a loop invoking the add kernel until we computed all input data.
所以总的来说,我们有一个循环调用 add kernel 直到我们计算出所有输入数据。
Example: All my input values are set to 1. The output buffer size is 2432. The input buffer size is 2432 *2000.示例:我所有的输入值都设置为 1。output 缓冲区大小为 2432。输入缓冲区大小为 2432 *2000。 2000 times the add kernel is called to add 1 to each field of output.
调用 add kernel 2000 次,将 output 的每个字段加 1。 The endresult in output is 2000 at every field.
output 的最终结果在每个领域都是 2000。 I call the function aggregate which contains a for loop, calling the kernel as often as needed to pass over the complete input data.
我调用包含 for 循环的 function 聚合,并根据需要经常调用 kernel 以传递完整的输入数据。 This works so far unless I call the kernel too often.
到目前为止,除非我过于频繁地调用 kernel,否则此方法有效。
However if I call the Kernel 2500 times, I get an illegalmemoryaccess cuda error.但是,如果我调用 Kernel 2500 次,我会收到非法内存访问 cuda 错误。
As you can see, the runtime of the last successfull kernel increases by 3 orders of magnitude.可以看到,最后一个成功的 kernel 的运行时间增加了 3 个数量级。 Afterwards my pointers are invalidated and the following invocations result in CudaErrorIllegalAdress.
之后我的指针无效,以下调用导致 CudaErrorIllegalAdress。
I cleaned up the code to get a minimal working example:我清理了代码以获得一个最小的工作示例:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <vector>
#include <stdio.h>
#include <iostream>
using namespace std;
template <class T> __global__ void addKernel_2432(int *in, int * out)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
out[i] = out[i] + in[i];
}
static int aggregate(int* array, size_t size, int* out) {
size_t const vectorCount = size / 2432;
cout << "ITERATIONS: " << vectorCount << endl;
for (size_t i = 0; i < vectorCount-1; i++)
{
addKernel_2432<int><<<19,128>>>(array, out);
array += vectorCount;
}
addKernel_2432<int> << <19, 128 >> > (array, out);
return 1;
}
int main()
{
int* dev_in1 = 0;
size_t vectorCount = 2432;
int * dev_out = 0;
size_t datacount = 2432*2500;
std::vector<int> hostvec(datacount);
//create input buffer, filled with 1
std::fill(hostvec.begin(), hostvec.end(), 1);
//allocate input buffer and output buffer
cudaMalloc(&dev_in1, datacount*sizeof(int));
cudaMalloc(&dev_out, vectorCount * sizeof(int));
//set output buffer to 0
cudaMemset(dev_out, 0, vectorCount * sizeof(int));
//copy input buffer to GPU
cudaMemcpy(dev_in1, hostvec.data(), datacount * sizeof(int), cudaMemcpyHostToDevice);
//call kernel datacount / vectorcount times
aggregate(dev_in1, datacount, dev_out);
//return data to check for corectness
cudaMemcpy(hostvec.data(), dev_out, vectorCount*sizeof(int), cudaMemcpyDeviceToHost);
if (cudaSuccess != cudaMemcpy(hostvec.data(), dev_out, vectorCount * sizeof(int), cudaMemcpyDeviceToHost))
{
cudaError err = cudaGetLastError();
cout << " CUDA ERROR: " << cudaGetErrorString(err) << endl;
}
else
{
cout << "NO CUDA ERROR" << endl;
cout << "RETURNED SUM DATA" << endl;
for (int i = 0; i < 2432; i++)
{
cout << hostvec[i] << " ";
}
}
cudaDeviceReset();
return 0;
}
If you compile and run it, you get an error.如果你编译并运行它,你会得到一个错误。 Change:
改变:
size_t datacount = 2432 * 2500; size_t 数据计数 = 2432 * 2500;
to至
size_t datacount = 2432 * 2400; size_t 数据计数 = 2432 * 2400;
and it gives the correct results.它给出了正确的结果。
I am looking for any ideas, why it breaks after 2432 kernel invocations.我正在寻找任何想法,为什么它在 2432 kernel 调用后中断。
What i have found so far googeling around: Wrong target architecture set.到目前为止我在谷歌上发现了什么:错误的目标架构集。 I use a 1070ti.
我用的是1070ti。 My target is set to: compute_61,sm_61 In visual studio project properties.
我的目标设置为:compute_61,sm_61 在 Visual Studio 项目属性中。 That does not change anything.
这不会改变任何事情。
Did I miss something?我错过了什么? Is there a limit how many times a kernel can be called until cuda invalidates pointer?
在 cuda 使指针无效之前,可以调用 kernel 的次数是否有限制? Thank you for your help.
谢谢您的帮助。 I used windows, Visual Studio 2019 and CUDA runtime 11.
我使用了 windows、Visual Studio 2019 和 CUDA 运行时 11。
This is the output in both cases.在这两种情况下,这都是 output。 Succes and failure:
成功与失败:
[ [
Error: [错误: [
static int aggregate(int* array, size_t size, int* out) {
size_t const vectorCount = size / 2432;
for (size_t i = 0; i < vectorCount-1; i++)
{
array += vectorCount;
}
}
That's not vectorCount
but the number of iterations you have been accidentally incrementing by.那不是
vectorCount
,而是您意外增加的迭代次数。 Works fine while vectorCount <= 2432
(but yields wrong results), and results in buffer overflow above.在
vectorCount <= 2432
时工作正常(但产生错误的结果),并导致上面的缓冲区溢出。
array += 2432
is what you intended to write. array += 2432
是你打算写的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.