内核上的CUDA非法内存访问

Question

I am trying to implement a basic device array type on CUDA, as an exercise. 我正在尝试在CUDA上实现基本设备阵列类型。 It should mimic the std::array interface, as a design goal. 它应该模仿std :: array接口作为设计目标。 While implementing operator+ , I am getting illegal memory access error and I can't decipher why. 实施operator+ ，我收到非法的内存访问错误，无法解释原因。 Here is the code. 这是代码。

#include <iostream>
#include <array>

enum class memcpy_t {
    host_to_host,
    host_to_device,
    device_to_host,
    device_to_device
};

bool check_cuda_err() {
    cudaError_t err = cudaGetLastError();
    if(err == cudaSuccess) {
        return true;
    }
    else {
        std::cerr << "Cuda Error: " << cudaGetErrorString(err) << "\n" << std::flush;
        return false;
    }
}

template <typename T, std::size_t N>
struct cuda_allocator {
    using pointer = T*;

    static void allocate(T *&dev_mem) {
        cudaMalloc(&dev_mem, N * sizeof(T));
    }

    static void deallocate(T *dev_mem) {
        cudaFree(dev_mem);
    }

    template <memcpy_t ct>
    static void copy (T *dst, T *src) {
        switch(ct) {
        case memcpy_t::host_to_host:
            cudaMemcpy(dst, src, N * sizeof(T), cudaMemcpyHostToHost);
            break;
        case memcpy_t::host_to_device:
            cudaMemcpy(dst, src, N * sizeof(T), cudaMemcpyHostToDevice);
            break;
        case memcpy_t::device_to_host:
            cudaMemcpy(dst, src, N * sizeof(T), cudaMemcpyDeviceToHost);
            break;
        case memcpy_t::device_to_device:
            cudaMemcpy(dst, src, N * sizeof(T), cudaMemcpyDeviceToDevice);
            break;
        default:
            break;
        }
    }
};

template <typename T, std::size_t N>
struct gpu_array {
    using allocator = cuda_allocator<T, N>;
    using pointer = typename allocator::pointer;
    using value_type = T;
    using iterator = T*;
    using const_iterator = T const*;

    gpu_array() {
       allocator::allocate(data);
    }

    gpu_array(std::array<T, N> host_arr) {
        allocator::allocate(data);
        allocator::template copy<memcpy_t::host_to_device>(data, host_arr.begin());
    }

    gpu_array& operator=(gpu_array const& o) {
        //allocator::allocate(data);
        allocator::template copy<memcpy_t::device_to_device>(data, o.begin());
    }

    operator std::array<T, N>() {
        std::array<T, N> res;
        allocator::template copy<memcpy_t::device_to_host>(res.begin(), data);
        return res;
    }

    ~gpu_array() {
        allocator::deallocate(data);
    }

    __device__ iterator begin() { return data; }
    __device__ iterator end() { return data + N; }
    __device__ const_iterator begin() const { return data; }
    __device__ const_iterator end() const { return data + N; }

private:
    T* data;
};

template <typename T, std::size_t N>
__global__ void add_kernel(gpu_array<T,N> **r,
                           gpu_array<T,N> const* a1,
                           gpu_array<T,N> const* a2) {
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    printf("Index: %d\n", i);
    (*r)->begin()[i] = a1->begin()[i] + a2->begin()[i];
}

template <typename T, std::size_t N>
gpu_array<T, N> operator+(gpu_array<T,N> const&a1,
                          gpu_array<T,N> const&a2)
{
    gpu_array<T, N> *res = new gpu_array<T, N>;
    add_kernel<<<(N+3)/4, 4>>>(&res, &a1, &a2);
    cudaDeviceSynchronize();
    check_cuda_err();
    // ignore memory leak for now
    return *res;
}
const int N = 1<<3;

int main() {
    std::array<float, N> x,y;

    for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
    } 

    gpu_array<float, N> dx{x};
    gpu_array<float, N> dy{y};
    check_cuda_err(); // shows no error for memcpy
    std::array<float, N> res = dx + dy;

    for(const auto& elem : res) {
        std::cout << elem << ", ";
    }
}

I am creating a size 8 array, to test things. 我正在创建一个大小为8的数组，以测试事物。 As you can see, cuda_check_err() shows no error after gpu_array initialization from host arrays. 如您所见，从主机阵列初始化gpu_array后， cuda_check_err()不会显示任何错误。 I am guessing copying data works correctly. 我猜复制数据工作正常。 But in the kernel, when I index the device arrays, I am getting illegal memory access error. 但是在内核中，当我索引设备阵列时，我收到了illegal memory access错误。 Here is the output: 这是输出：

Index: 0 索引：0

Index: 1 索引：1

Index: 2 索引：2

Index: 3 索引：3

Index: 4 索引：4

Index: 5 索引：5

Index: 6 索引：6

Index: 7 索引：7

Cuda Error: an illegal memory access was encountered Cuda错误：遇到非法的内存访问

9.45143e-39, 0, 6.39436e-39, 0, 0, 0, 0, 0, 9.45143e-39，0，6.39436e-39，0，0，0，0，0，

As you can see, I've printed computed index for each thread and nothing seems to be out of bounds. 如您所见，我已经为每个线程打印了计算索引，似乎没有任何超出范围。 So, what might cause this illegal memory access error? 那么，什么原因可能导致此非法内存访问错误？ By the way, cuda-memcheck says: 顺便说一句， cuda-memcheck说：

Invalid global read of size 8 无效的8号全局读取

and later 然后

Address 0x7fff9f4c6ec0 is out of bounds 地址0x7fff9f4c6ec0超出范围

but I've printed the indices, don't know why it is out of bounds. 但是我已经打印了索引，不知道为什么它超出范围。

Answer 1

We have seen two versions of code in this question, and unfortunately both have different versions of the same problem. 我们已经在这个问题中看到了两个版本的代码，不幸的是，两个版本都具有相同问题的不同版本。

The first used references as arguments to the kernel: 第一个使用引用作为内核的参数：

template <typename T, std::size_t N>
 __global__ void add_kernel(gpu_array<T,N> &r,
                       gpu_array<T,N> const&a1,
                       gpu_array<T,N> const&a2) {
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    printf("Index: %d\n", i);
    r.begin()[i] = a1.begin()[i] + a2.begin()[i];
}

template <typename T, std::size_t N>
gpu_array<T, N> operator+(gpu_array<T,N> const&a1,
                      gpu_array<T,N> const&a2)
{
    gpu_array<T, N> res;
    add_kernel<<<(N+3)/4, 4>>>(res, a1, a2);
    cudaDeviceSynchronize();
    check_cuda_err();
    return res;
 }

While this is clean and elegant, and references are fully supported in CUDA kernel code, passing kernel arguments by reference from the host winds up with host addresses as arguments in the device because the CUDA toolchain, like every other C++ compiler I am aware of, implements references using pointers. 尽管这是干净优雅的，并且CUDA内核代码中完全支持引用，但是从主机通过引用传递内核参数最终会以主机地址作为设备中的参数，因为CUDA工具链与我所知道的所有其他C ++编译器一样，使用指针实现引用。 The result is a kernel runtime error for illegal addresses. 结果是非法地址的内核运行时错误。

The second uses pointer indirection instead of references and winds up passing host pointers to the GPU which fails pretty much identically to the first version: 第二种方法使用指针间接寻址而不是引用，并且结束了将主机指针传递给GPU的工作，该方法与第一种版本几乎完全相同：

template <typename T, std::size_t N>
__global__ void add_kernel(gpu_array<T,N> **r,
                           gpu_array<T,N> const* a1,
                           gpu_array<T,N> const* a2) {
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    printf("Index: %d\n", i);
    (*r)->begin()[i] = a1->begin()[i] + a2->begin()[i];
}

template <typename T, std::size_t N>
gpu_array<T, N> operator+(gpu_array<T,N> const&a1,
                          gpu_array<T,N> const&a2)
{
    gpu_array<T, N> *res = new gpu_array<T, N>;
    add_kernel<<<(N+3)/4, 4>>>(&res, &a1, &a2); 
    cudaDeviceSynchronize();
    check_cuda_err();
    // ignore memory leak for now
    return *res;
}

The only safe implementation for passing this structure directly to device kernels will be using pass-by-value. 将这种结构直接传递到设备内核的唯一安全实现是使用值传递。 However that will mean that copies will fall out of scope and trigger destruction, which will deallocate the memory backing the arrays and result in unexpected errors of a different kind. 但是，这将意味着副本将超出范围并触发破坏，这将释放分配给数组的内存，并导致其他意外错误。

内核上的CUDA非法内存访问

问题描述

1 个解决方案

解决方案1
2 已采纳

内核上的CUDA非法内存访问

问题描述

1 个解决方案

解决方案1 2 已采纳

解决方案1
2 已采纳