cudaMemcpy不起作用

Question

In the following code there's the cudaMemcpy not working, it returns an error, and the program exits. 在以下代码中，cudaMemcpy无法正常工作，它返回错误，程序退出。 What can be the problem? 可能是什么问题？ It doesn't seem to me I'm doing something illegal, and the size of the vectors seem fine to me. 在我看来，我并没有在做非法的事情，矢量的大小对我来说似乎还不错。

It might be possible the algorithm does something wrong at some point but the idea is correct I guess. 该算法可能在某些时候做错了一些，但我想这是正确的。 The code is to sum n numbers by doing some partial sums in parallel, and then re-iterate. 代码是通过并行执行一些部分和来对n个数字求和，然后重新进行迭代。

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <iostream>

__device__ int aug_vec(int *vec, const int& i, const int& size) {
    return (i >= size) ? 0 : vec[i];
}

__global__ void sumVectorElements(int *vec,const int& size) {
    const int i = (blockDim.x*blockIdx.x + threadIdx.x);
    vec[i] = aug_vec(vec, 2*i, size) + aug_vec(vec, 2 * i + 1, size);
}

__host__ int parallel_sum(int *vec,const int& size) {

    cudaError_t err;
    int *d_vec, *cp_vec;
    int n_threads = (size >> 1) + (size & 1);

    cp_vec = new int[size];
    err = cudaMalloc((void**)&d_vec, size * sizeof(int));

    if (err != cudaSuccess) {
        std::cout << "error in cudaMalloc!" << std::endl;
        exit(1);
    }

    err = cudaMemcpy(d_vec, vec, size*sizeof(int), cudaMemcpyHostToDevice);

    if (err != cudaSuccess) {
        std::cout << "error in cudaMemcpy!" << std::endl;
        exit(1);
    }

    int curr_size = size;
    while (curr_size > 1) {
        std::cout << "size = " << curr_size << std::endl;
        sumVectorElements<<<1,n_threads>>>(d_vec, curr_size);
        curr_size = (curr_size >> 1) + (curr_size & 1);
    }

    err = cudaMemcpy(cp_vec, d_vec, size*sizeof(int), cudaMemcpyDeviceToHost); //THIS LINE IS THE PROBLEM!

    if (err != cudaSuccess) {
        std::cout << "error in cudaMemcpy" << std::endl;
        exit(1);
    }

    err = cudaFree(d_vec);

    if (err != cudaSuccess) {
        std::cout << "error in cudaFree" << std::endl;
        exit(1);
    }

    int rval = cp_vec[0];

    delete[] cp_vec;

    return rval;
}

int main(int argc, char **argv) {
    const int n_blocks = 1;
    const int n_threads_per_block = 12;

    int vec[12] = { 0 };
    for (auto i = 0; i < n_threads_per_block; ++i) vec[i] = i + 1;
    int sum = parallel_sum(vec, n_threads_per_block);
    std::cout << "Sum = " << sum << std::endl;

    system("pause");

    return 0;
}

Answer 1

The cudaMemcpy operation after the kernel is actually asynchronously reporting an error that is due to the kernel execution. 内核之后的cudaMemcpy操作实际上异步报告由于内核执行而导致的错误。 Your error reporting is primitive. 您的错误报告是原始的。 If you have an error code, you may get more useful information by printing out the result of passing that error code to cudaGetErrorString() . 如果您有错误代码，则可以通过打印出将该错误代码传递给cudaGetErrorString()的结果来获得更多有用的信息。

The error is occurring in the kernel due to use of the reference argument: 由于使用了引用参数，内核中发生了错误：

__global__ void sumVectorElements(int *vec,const int& size) {
                                           ^^^^^^^^^^^^^^^

Any argument you pass to a kernel and expect to be usable in kernel code must refer to data that is passed by value, or else data that is accessible/referenceable from device code. 您传递给内核并希望在内核代码中使用的任何参数都必须引用按值传递的数据，或者是可从设备代码访问/引用的数据。 For example, passing a host pointer to device code is generally not legal in CUDA, because an attempt to dereference a host pointer in device code will fail. 例如，在CUDA中将主机指针传递给设备代码通常是不合法的，因为尝试取消引用设备代码中的主机指针将失败。

The exceptions to the above would be data/pointers/references that are accessible in device code. 上面的例外是设备代码中可访问的数据/指针/引用。 Unified memory and pinned/mapped data are two examples, neither of which are being used here. 统一内存和固定/映射数据是两个示例，此处均未使用。

As a result, the reference parameter involves a reference (an address, basically) for a an item ( size ) in host memory. 结果，引用参数涉及主机内存中一项（ size ）的引用（基本上是地址）。 When the kernel code attempts to use this item, it must first de-reference it. 当内核代码尝试使用该项目时，必须首先取消引用它。 The dereferenceing of a host item in device code is illegal in CUDA (unless using UM or pinned memory). 在CUDA中，在设备代码中取消引用宿主项是非法的（除非使用UM或固定内存）。

The solution in this case is simple: convert to an ordinary pass-by-value situation: 在这种情况下，解决方案很简单：转换为普通的按值传递情况：

__global__ void sumVectorElements(int *vec,const int size) ...
                                                    ^
                                                 remove ampersand

cudaMemcpy不起作用

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-03-18 22:41:41

cudaMemcpy不起作用

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-03-18 22:41:41

解决方案1
2 已采纳 2018-03-18 22:41:41