尽管CudaSuccess，CUDA cudaMemCpy似乎没有复制

Question

I'm just starting with CUDA and this is my very first project. 我刚开始使用CUDA，这是我的第一个项目。 I've done a search for this issue and while I've noticed other people have had similar problems, none of the suggestions seemed relevant to my specific issue or have helped in my case. 我已经搜索了这个问题，虽然我注意到其他人也有类似的问题，但是这些建议似乎都与我的特定问题无关，也没有对我的情况有所帮助。

As an exercise, I'm trying to write an n-body simulation using CUDA. 作为练习，我试图使用CUDA编写n体模拟。 At this stage I'm not interested whether my specific implementation is efficient or not, I'm just looking for something that works and I can refine it later. 在此阶段，我对我的特定实现是否高效不感兴趣，我只是在寻找可行的方法，以后可以对其进行改进。 I'll also need to update the code later, once it's working, to work on my SLI configuration. 一旦工作，我还需要稍后更新代码，以使用我的SLI配置。

Here's a brief outline of the process: 这是该过程的简要概述：

Create X and Y position, velocity, acceleration vectors. 创建X和Y位置，速度，加速度矢量。
Create same vectors on GPU and copy values across 在GPU上创建相同的向量并跨向量复制值
In a loop: (i) calculate acceleration for the iteration, (ii) apply acceleration to velocities and positions, and (iii) copy positions back to host for display. 在一个循环中：（i）计算迭代的加速度，（ii）将加速度应用于速度和位置，并且（iii）将位置复制回主机以供显示。

(Display not implemented yet. I'll do this later) （显示尚未实现。我将在以后进行）

Don't worry about the acceleration calculation function for now, here is the update function: 现在不用担心加速度计算功能，这是更新功能：

__global__ void apply_acc(double* pos_x, double* pos_y, double* vel_x, double* vel_y, double* acc_x, double* acc_y, int N)
{
    int i = threadIdx.x;

    if (i < N);
    {
        vel_x[i] += acc_x[i];
        vel_y[i] += acc_y[i];

        pos_x[i] += vel_x[i];
        pos_y[i] += vel_y[i];
    }
}

And here's some of the code in the main method: 这是main方法中的一些代码：

cudaError t;

t = cudaMalloc(&d_pos_x, N * sizeof(double));
t = cudaMalloc(&d_pos_y, N * sizeof(double));
t = cudaMalloc(&d_vel_x, N * sizeof(double));
t = cudaMalloc(&d_vel_y, N * sizeof(double));
t = cudaMalloc(&d_acc_x, N * sizeof(double));
t = cudaMalloc(&d_acc_y, N * sizeof(double));

t = cudaMemcpy(d_pos_x, pos_x, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_pos_y, pos_y, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_vel_x, vel_x, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_vel_y, vel_y, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_acc_x, acc_x, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_acc_y, acc_y, N * sizeof(double), cudaMemcpyHostToDevice);

while (true)
{
    calc_acc<<<1, N>>>(d_pos_x, d_pos_y, d_vel_x, d_vel_y, d_acc_x, d_acc_y, N);
    apply_acc<<<1, N>>>(d_pos_x, d_pos_y, d_vel_x, d_vel_y, d_acc_x, d_acc_y, N);

    t = cudaMemcpy(pos_x, d_pos_x, N * sizeof(double), cudaMemcpyDeviceToHost);
    t = cudaMemcpy(pos_y, d_pos_y, N * sizeof(double), cudaMemcpyDeviceToHost);

    std::cout << pos_x[0] << std::endl;
}

Every loop, cout writes the same value, whatever random value it was set to when the position arrays were original created. 在每个循环中， cout写入相同的值，无论原始创建位置数组时将其设置为什么随机值。 If I change the code in apply_acc to something like: 如果我将apply_acc的代码apply_acc为类似以下内容：

__global__ void apply_acc(double* pos_x, double* pos_y, double* vel_x, double* vel_y, double* acc_x, double* acc_y, int N)
{
    int i = threadIdx.x;

    if (i < N);
    {
        pos_x[i] += 1.0;
        pos_y[i] += 1.0;
    }
}

then it still gives the same value, so either apply_acc isn't being called or the cudaMemcpy isn't copying the data back. 那么它仍然会提供相同的值，因此不会调用apply_acc或cudaMemcpy不会将数据复制回去。

All the cudaMalloc and cudaMemcpy calls return cudaScuccess . 所有的cudaMalloc和cudaMemcpy调用都返回cudaScuccess 。

Here 'sa PasteBin link to the complete code. 这里是完整代码的PasteBin链接。 It should be fairly simple to follow as there's a lot of repetition for the various arrays. 遵循起来应该很简单，因为各种数组都有很多重复。

Like I said, I've never written CUDA code before, and I wrote this based on the #2 CUDA example video from NVidia where the guy writes the parallel array addition code. 就像我说的那样，我以前从未编写过CUDA代码，而是根据NVidia的＃2 CUDA示例视频编写的，该人员编写了并行数组附加代码。 I'm not sure if it makes any difference, but I'm using 2x GTX970's with the latest NVidia drivers and CUDA 7.0 RC, and I chose not to install the bundled drivers when installing CUDA as they were older than what I had. 我不确定是否会有所不同，但是我将2x GTX970与最新的NVidia驱动程序和CUDA 7.0 RC一起使用，并且我选择在安装CUDA时不安装捆绑的驱动程序，因为它们比我的要旧。

Answer 1

This won't work: 这行不通：

const int N = 100000;
...
calc_acc<<<1, N>>>(...);
apply_acc<<<1, N>>>(...);

The second parameter of a kernel launch config ( <<<...>>> ) is the threads per block parameter. 内核启动配置（ <<<...>>> ）的第二个参数是每个块的线程参数。 It is limited to either 512 or 1024 depending on how you are compiling. 根据您的编译方式，它限制为512或1024。 These kernels will not launch, and the type of error this produces needs to be caught by using correct CUDA error checking . 这些内核将不会启动，并且需要通过使用正确的CUDA错误检查来捕获由此产生的错误类型。 Simply looking at the return values of subsequent CUDA API functions will not indicate the presence of this type of error (which is why you are seeing cudaSuccess subsequently). 仅查看后续CUDA API函数的返回值将不会表明存在这种类型的错误（这就是为什么您随后看到cudaSuccess原因）。

Regarding the concept itself, I suggest you learn more about CUDA thread and block hierarchy . 关于概念本身，我建议您进一步了解CUDA线程和块层次结构。 To launch a large number of threads, you need to use both parameters (ie niether of the first two parameters should be 1) of the kernel launch config. 要启动大量线程，您需要使用内核启动配置的两个参数（即前两个参数中的一个都不为1）。 This is usually advisable from a performance perspective as well. 通常从性能角度来看也是建议的。

尽管CudaSuccess，CUDA cudaMemCpy似乎没有复制

问题描述

1 个解决方案

解决方案1
3 已采纳 2015-02-13 00:15:57

尽管CudaSuccess，CUDA cudaMemCpy似乎没有复制

问题描述

1 个解决方案

解决方案1 3 已采纳 2015-02-13 00:15:57

解决方案1
3 已采纳 2015-02-13 00:15:57