在设备上访问动态分配的 arrays（不将它们作为 kernel 参数传递）

Question

How can an array of structs that has been dynamically allocated on the host be used by a kernel, without passing the array of structs as a kernel argument? kernel 如何使用在主机上动态分配的结构数组，而不将结构数组作为 kernel 参数传递？ This seems like a common procedure with a good amount of documentation online, yet it doesn't work on the following program.这似乎是一个具有大量在线文档的常见过程，但它不适用于以下程序。

Note: Please note that the following questions have been studied before posting this question:注意：请注意，在发布此问题之前已经研究了以下问题：

1) copying host memory to cuda __device__ variable 2) Global variable in CUDA 3) Is there any way to dynamically allocate constant memory? 1) copying host memory to cuda __device__ variable 2) Global variable in CUDA 3) Is there any way to dynamically allocate constant memory? CUDA CUDA

So far, unsuccessful attempts have been made to:到目前为止，已经进行了不成功的尝试：

Dynamically allocate array of structs with cudaMalloc() , then使用cudaMalloc()动态分配结构数组，然后
Use cudaMemcpyToSymbol() with the pointer returned from cudaMalloc() to copy to a __device__ variable which can be used by the kernel.使用cudaMemcpyToSymbol()和cudaMalloc()返回的指针复制到 kernel 可以使用的__device__变量。

Code attempt:代码尝试：

NBody.cu (error checking using cudaStatus has mostly been omitted for better readability, and function to read data from file into dynamic array removed): NBody.cu（为了更好的可读性，使用cudaStatus的错误检查大多被省略，并且 function 将数据从文件读取到动态数组中删除）：

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <stdlib.h>

#define BLOCK 256

struct nbody {
    float x, y, vx, vy, m;
};
typedef struct nbody nbody;

// Global declarations
nbody* particle;

// Device variables
__device__ unsigned int d_N;  // Kernel can successfully access this
__device__ nbody d_particle;  // Update: part of problem was here with (*)

// Aim of kernel: to print contents of array of structs without using kernel argument
__global__ void step_cuda_v1() {
    int i = threadIdx.x + blockDim.x * blockIdx.x;

    if (i < d_N) {
        printf("%.f\n", d_particle.x);
    }
}

int main() {
    unsigned int N = 10;
    unsigned int I = 1;

    cudaMallocHost((void**)&particle, N * sizeof(nbody)); // Host allocation

    cudaError_t cudaStatus;
    for (int i = 0; i < N; i++) particle[i].x = i;

    nbody* particle_buf; // device buffer
    cudaSetDevice(0);

    cudaMalloc((void**)&particle_buf, N * sizeof(nbody)); // Allocate device mem
    cudaMemcpy(particle_buf, particle, N * sizeof(nbody), cudaMemcpyHostToDevice); // Copy data into device mem
    cudaMemcpyToSymbol(d_particle, &particle_buf, sizeof(nbody*)); // Copy pointer to data into __device__ var
    cudaMemcpyToSymbol(d_N, &N, sizeof(unsigned int)); // This works fine

    int NThreadBlock = (N + BLOCK - 1) / BLOCK;
    for (int iteration = 0; iteration <= I; iteration++) {

        step_cuda_v1 << <NThreadBlock, BLOCK >> > ();
        //step_cuda_v1 << <1, 5 >> > (particle_buf);
        cudaDeviceSynchronize();
        cudaStatus = cudaGetLastError();
        if (cudaStatus != cudaSuccess)
        {
            fprintf(stderr, "ERROR: %s\n", cudaGetErrorString(cudaStatus));
            exit(-1);
        }
    }
    return 0;
}

OUTPUT: OUTPUT：

"ERROR: kernel launch failed." “错误：kernel 启动失败。”

Summary:概括：

How can I print the contents of the array of structs from the kernel, without passing it as a kernel argument?如何打印来自 kernel 的结构数组的内容，而不将其作为 kernel 参数传递？
Coding in C using VS2019 with CUDA 10.2使用 VS2019 和 CUDA 10.2 在 C 中编码

Answer 1

With the help of @Robert Crovella and @talonmies, here is the solution that outputs a sequence that cycles from 0 to 9 repeatedly.在@Robert Crovella 和@talonmies 的帮助下，这里是输出从0 到9 重复循环的序列的解决方案。

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <stdlib.h>

#define BLOCK 256

//#include "Nbody.h"
struct nbody {
    float x, y, vx, vy, m;
};
typedef struct nbody nbody;

// Global declarations
nbody* particle;

// Device variables
__device__ unsigned int d_N;  // Kernel can successfully access this
__device__ nbody* d_particle;
//__device__ nbody d_particle;  // Update: part of problem was here with (*)

// Aim of kernel: to print contents of array of structs without using kernel argument
__global__ void step_cuda_v1() {
    int i = threadIdx.x + blockDim.x * blockIdx.x;

    if (i < d_N) {
        printf("%.f\n", d_particle[i].x);
    }
}

int main() {
    unsigned int N = 10;
    unsigned int I = 1;

    cudaMallocHost((void**)&particle, N * sizeof(nbody)); // Host allocation

    cudaError_t cudaStatus;
    for (int i = 0; i < N; i++) particle[i].x = i;

    nbody* particle_buf; // device buffer
    cudaSetDevice(0);

    cudaMalloc((void**)&particle_buf, N * sizeof(nbody)); // Allocate device mem
    cudaMemcpy(particle_buf, particle, N * sizeof(nbody), cudaMemcpyHostToDevice); // Copy data into device mem
    cudaMemcpyToSymbol(d_particle, &particle_buf, sizeof(nbody*)); // Copy pointer to data into __device__ var
    cudaMemcpyToSymbol(d_N, &N, sizeof(unsigned int)); // This works fine

    int NThreadBlock = (N + BLOCK - 1) / BLOCK;
    for (int iteration = 0; iteration <= I; iteration++) {

        step_cuda_v1 << <NThreadBlock, BLOCK >> > ();
        //step_cuda_v1 << <1, 5 >> > (particle_buf);
        cudaDeviceSynchronize();
        cudaStatus = cudaGetLastError();
        if (cudaStatus != cudaSuccess)
        {
            fprintf(stderr, "ERROR: %s\n", cudaGetErrorString(cudaStatus));
            exit(-1);
        }
    }
    return 0;
}

在设备上访问动态分配的 arrays（不将它们作为 kernel 参数传递）

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-05-19 12:02:59

在设备上访问动态分配的 arrays（不将它们作为 kernel 参数传递）

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-05-19 12:02:59

解决方案1
0 已采纳 2020-05-19 12:02:59