如何在 NVIDIA GPU 上处理来自 CPU 的 RGB 数据并使用 OpenGL 纹理可视化数据

Question

I'm hoping to create a simple computer vision library in C++/CUDA C++ that allows me to do the following:我希望在 C++/CUDA C++ 中创建一个简单的计算机视觉库，它允许我执行以下操作：

Grab some RGB data from the host memory.从主机内存中获取一些 RGB 数据。 This data will come in a BGR byte array, 8 bits per channel per pixel.该数据将出现在 BGR 字节数组中，每个像素每个通道 8 位。
Process that data in a CUDA kernel.在 CUDA 内核中处理该数据。
Write the output of that kernel back into some host memory.将该内核的输出写回某个主机内存。
Render the output in an OpenGL texture for easy viewing.在 OpenGL 纹理中渲染输出以便于查看。

These functions would go inside a class like so:这些函数将进入一个类，如下所示：


class Processor{
public:
    setInput(const byte* data, int imageWidth, int imageHeight);
    void processData();
    GLuint getInputTexture();
    GLuint getOutputTexture();
    void writeOutputTo(byte* destination);
}

setInput() is going to be called with every frame of a video (hundreds or thousands of images of the same dimensions). setInput()将在视频的每一帧（数百或数千个相同尺寸的图像setInput()中调用。

How can I write the Processor class so that setInput() can efficiently update an instance's internal CUDA array and processData() can synchronize the CUDA array with the OpenGL texture?如何编写 Processor 类，以便setInput()可以有效地更新实例的内部 CUDA 数组，而processData()可以将 CUDA 数组与 OpenGL 纹理同步？

Below is my attempt at implementing such a class, contained in one CUDA C++ file along with a simple test.下面是我实现这样一个类的尝试，包含在一个 CUDA C++ 文件中以及一个简单的测试。 (Requires GLFW and GLAD .) With this implementation, I can provide some input image data, run a CUDA kernel that produces an output image, and visualize both with OpenGL textures. （需要GLFW和GLAD 。）通过此实现，我可以提供一些输入图像数据，运行生成输出图像的 CUDA 内核，并使用 OpenGL 纹理将两者可视化。 But it's extremely inefficient because every time setInput() is called, two OpenGL textures and two CUDA surface objects need to be created.但它的效率极低，因为每次调用setInput()都需要创建两个 OpenGL 纹理和两个 CUDA 表面对象。 And if more than one image is processed, two OpenGL textures and two CUDA surface objects also have to be destroyed.如果处理的图像不止一张，则还必须销毁两个 OpenGL 纹理和两个 CUDA 表面对象。

#include <glad/glad.h>
#include <GLFW/glfw3.h>

#include <cudaGL.h>
#include <cuda_gl_interop.h>

#include <iostream>


/** Macro for checking if CUDA has problems */
#define cudaCheckError() { \
    cudaError_t err = cudaGetLastError(); \
    if(err != cudaSuccess) { \
      printf("Cuda error: %s:%d: %s\n", __FILE__, __LINE__, cudaGetErrorString(err)); \
      exit(1); \
    } \
  }


/*Window dimensions*/
const int windowWidth = 1280, windowHeight = 720;
/*Window address*/
GLFWwindow* currentGLFWWindow = 0;


/**
 * A simple image processing kernel that copies the inverted data from the input surface to the output surface.
 */
__global__ void kernel(cudaSurfaceObject_t input, cudaSurfaceObject_t output, int width, int height) {

    //Get the pixel index
    unsigned int xPx = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int yPx = threadIdx.y + blockIdx.y * blockDim.y;


    //Don't do any computation if this thread is outside of the surface bounds.
    if (xPx >= width || yPx >= height) return;

    //Copy the contents of input to output.
    uchar4 pixel = { 255,128,0,255 };
    //Read a pixel from the input. Disable to default to the flat orange color above
    surf2Dread<uchar4>(&pixel, input, xPx * sizeof(uchar4), yPx, cudaBoundaryModeClamp);

    //Invert the color
    pixel.x = ~pixel.x;
    pixel.y = ~pixel.y;
    pixel.z = ~pixel.z;

    //Write the new pixel color to the 
    surf2Dwrite(pixel, output, xPx * sizeof(uchar4), yPx);
}

class Processor {
public:
    void setInput( uint8_t* const data, int imageWidth, int imageHeight);
    void processData();
    GLuint getInputTexture();
    GLuint getOutputTexture();
    void writeOutputTo(uint8_t* destination);
private:
    /**
    * @brief True if the textures and surfaces are initialized.
    *
    * Prevents memory leaks
    */
    bool surfacesInitialized = false;
    /**
     * @brief The width and height of a texture/surface pair.
     *
     */
    struct ImgDim { int width, height; };
    /**
     * @brief Creates a CUDA surface object, CUDA resource, and OpenGL texture from some data.
     */
    void createTextureSurfacePair(const ImgDim& dimensions, uint8_t* const data, GLuint& textureOut, cudaGraphicsResource_t& graphicsResourceOut, cudaSurfaceObject_t& surfaceOut);
    /**
     * @brief Destroys every CUDA surface object, CUDA resource, and OpenGL texture created by this instance.
     */
    void destroyEverything();
    /**
     * @brief The dimensions of an image and its corresponding texture.
     *
     */
    ImgDim imageInputDimensions, imageOutputDimensions;
    /**
     * @brief A CUDA surface that can be read to, written from, or synchronized with a Mat or
     * OpenGL texture
     *
     */
    cudaSurfaceObject_t d_imageInputTexture = 0, d_imageOutputTexture = 0;
    /**
     * @brief A CUDA resource that's bound to an array in CUDA memory
     */
    cudaGraphicsResource_t d_imageInputGraphicsResource, d_imageOutputGraphicsResource;
    /**
     * @brief A renderable OpenGL texture that is synchronized with the CUDA data
     * @see d_imageInputTexture, d_imageOutputTexture
     */
    GLuint imageInputTexture = 0, imageOutputTexture = 0;
    /** Returns true if nothing can be rendered */
    bool empty() { return imageInputTexture == 0; }

};


void Processor::setInput(uint8_t* const data, int imageWidth, int imageHeight)
{


    //Same-size images don't need texture regeneration, so skip that.
    if (imageHeight == imageInputDimensions.height && imageWidth == imageInputDimensions.width) {


        /*
        Possible shortcut: we know the input is the same size as the texture and CUDA surface object.
        So instead of destroying the surface and texture, why not just overwrite them?

        That's what I try to do in the following block, but because "data" is BGR and the texture
        is RGBA, the channels get all messed up.
        */

        /*
        //Use the input surface's CUDAResourceDesc to gain access to the surface data array
        struct cudaResourceDesc resDesc;
        memset(&resDesc, 0, sizeof(resDesc));
        cudaGetSurfaceObjectResourceDesc(&resDesc, d_imageInputTexture);
        cudaCheckError();

        //Copy the data from the input array to the surface
        cudaMemcpyToArray(resDesc.res.array.array, 0, 0, input.data, imageInputDimensions.width * imageInputDimensions.height * 3, cudaMemcpyHostToDevice);
        cudaCheckError();

        //Set status flags
        surfacesInitialized = true;

        return;
        */
    }


    //Clear everything that originally existed in the texture/surface
    destroyEverything();

    //Get the size of the image and place it here.
    imageInputDimensions.width = imageWidth;
    imageInputDimensions.height = imageHeight;
    imageOutputDimensions.width = imageWidth;
    imageOutputDimensions.height = imageHeight;

    //Create the input surface/texture pair
    createTextureSurfacePair(imageInputDimensions, data, imageInputTexture, d_imageInputGraphicsResource, d_imageInputTexture);

    //Create the output surface/texture pair
    uint8_t* outData = new uint8_t[imageOutputDimensions.width * imageOutputDimensions.height * 3];
    createTextureSurfacePair(imageOutputDimensions, outData, imageOutputTexture, d_imageOutputGraphicsResource, d_imageOutputTexture);
    delete outData;

    //Set status flags
    surfacesInitialized = true;
}

void Processor::processData()
{
    const int threadsPerBlock = 128;

    //Call the algorithm

    //Set the number of blocks to call the kernel with.
    dim3 blocks((unsigned int)ceil((float)imageInputDimensions.width / threadsPerBlock), imageInputDimensions.height);
    kernel <<<blocks, threadsPerBlock >>> (d_imageInputTexture, d_imageOutputTexture, imageInputDimensions.width, imageInputDimensions.height);

    //Sync the surface with the texture
    cudaDeviceSynchronize();
    cudaCheckError();
}

GLuint Processor::getInputTexture()
{
    return imageInputTexture;
}

GLuint Processor::getOutputTexture()
{
    return imageOutputTexture;
}

void Processor::writeOutputTo(uint8_t* destination)
{
    //Haven't figured this out yet
}

void Processor::createTextureSurfacePair(const Processor::ImgDim& dimensions, uint8_t* const data, GLuint& textureOut, cudaGraphicsResource_t& graphicsResourceOut, cudaSurfaceObject_t& surfaceOut) {

    // Create the OpenGL texture that will be displayed with GLAD and GLFW
    glGenTextures(1, &textureOut);
    // Bind to our texture handle
    glBindTexture(GL_TEXTURE_2D, textureOut);
    // Set texture interpolation methods for minification and magnification
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
    // Set texture clamping method
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP);
    // Create the texture and its attributes
    glTexImage2D(GL_TEXTURE_2D,     // Type of texture
        0,                // Pyramid level (for mip-mapping) - 0 is the top level
        GL_RGBA,          // Internal color format to convert to
        dimensions.width,            // Image width  i.e. 640 for Kinect in standard mode
        dimensions.height,           // Image height i.e. 480 for Kinect in standard mode
        0,                // Border width in pixels (can either be 1 or 0)
        GL_BGR,          // Input image format (i.e. GL_RGB, GL_RGBA, GL_BGR etc.)
        GL_UNSIGNED_BYTE, // Image data type.
        data);            // The actual image data itself
    //Note that the type of this texture is an RGBA UNSIGNED_BYTE type. When CUDA surfaces
    //are synchronized with OpenGL textures, the surfaces will be of the same type.
    //They won't know or care about their data types though, for they are all just byte arrays
    //at heart. So be careful to ensure that any CUDA kernel that handles a CUDA surface
    //uses it as an appropriate type. You will see that the update_surface kernel (defined 
    //above) treats each pixel as four unsigned bytes along the X-axis: one for red, green, blue,
    //and alpha respectively.

    //Create the CUDA array and texture reference
    cudaArray* bitmap_d;
    //Register the GL texture with the CUDA graphics library. A new cudaGraphicsResource is created, and its address is placed in cudaTextureID.
    //Documentation: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__OPENGL.html#group__CUDART__OPENGL_1g80d12187ae7590807c7676697d9fe03d
    cudaGraphicsGLRegisterImage(&graphicsResourceOut, textureOut, GL_TEXTURE_2D,
        cudaGraphicsRegisterFlagsNone);
    cudaCheckError();
    //Map graphics resources for access by CUDA.
    //Documentation: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__INTEROP.html#group__CUDART__INTEROP_1gad8fbe74d02adefb8e7efb4971ee6322
    cudaGraphicsMapResources(1, &graphicsResourceOut, 0);
    cudaCheckError();
    //Get the location of the array of pixels that was mapped by the previous function and place that address in bitmap_d
    //Documentation: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__INTEROP.html#group__CUDART__INTEROP_1g0dd6b5f024dfdcff5c28a08ef9958031
    cudaGraphicsSubResourceGetMappedArray(&bitmap_d, graphicsResourceOut, 0, 0);
    cudaCheckError();
    //Create a CUDA resource descriptor. This is used to get and set attributes of CUDA resources.
    //This one will tell CUDA how we want the bitmap_surface to be configured.
    //Documentation for the struct: https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaResourceDesc.html#structcudaResourceDesc
    struct cudaResourceDesc resDesc;
    //Clear it with 0s so that some flags aren't arbitrarily left at 1s
    memset(&resDesc, 0, sizeof(resDesc));
    //Set the resource type to be an array for convenient processing in the CUDA kernel.
    //List of resTypes: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g067b774c0e639817a00a972c8e2c203c
    resDesc.resType = cudaResourceTypeArray;
    //Bind the new descriptor with the bitmap created earlier.
    resDesc.res.array.array = bitmap_d;
    //Create a new CUDA surface ID reference.
    //This is really just an unsigned long long.
    //Docuentation: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1gbe57cf2ccbe7f9d696f18808dd634c0a
    surfaceOut = 0;
    //Create the surface with the given description. That surface ID is placed in bitmap_surface.
    //Documentation: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__SURFACE__OBJECT.html#group__CUDART__SURFACE__OBJECT_1g958899474ab2c5f40d233b524d6c5a01
    cudaCreateSurfaceObject(&surfaceOut, &resDesc);
    cudaCheckError();
}

void Processor::destroyEverything()
{
    if (surfacesInitialized) {

        //Input image CUDA surface
        cudaDestroySurfaceObject(d_imageInputTexture);
        cudaGraphicsUnmapResources(1, &d_imageInputGraphicsResource);
        cudaGraphicsUnregisterResource(d_imageInputGraphicsResource);
        d_imageInputTexture = 0;

        //Output image CUDA surface
        cudaDestroySurfaceObject(d_imageOutputTexture);
        cudaGraphicsUnmapResources(1, &d_imageOutputGraphicsResource);
        cudaGraphicsUnregisterResource(d_imageOutputGraphicsResource);
        d_imageOutputTexture = 0;

        //Input image GL texture
        glDeleteTextures(1, &imageInputTexture);
        imageInputTexture = 0;

        //Output image GL texture
        glDeleteTextures(1, &imageOutputTexture);
        imageOutputTexture = 0;

        surfacesInitialized = false;
    }
}


/** A way to initialize OpenGL with GLFW and GLAD */
void initGL() {

    // Setup window
    if (!glfwInit())
        return;

    // Decide GL+GLSL versions
#if __APPLE__
    // GL 3.2 + GLSL 150
    const char* glsl_version = "#version 150";
    glfwWindowHint(GLFW_CONTEXT_VERSION_MAJOR, 3);
    glfwWindowHint(GLFW_CONTEXT_VERSION_MINOR, 2);
    glfwWindowHint(GLFW_OPENGL_PROFILE, GLFW_OPENGL_CORE_PROFILE);  // 3.2+ only
    glfwWindowHint(GLFW_OPENGL_FORWARD_COMPAT, GL_TRUE);            // Required on Mac
#else
    // GL 3.0 + GLSL 130
    const char* glsl_version = "#version 130";
    glfwWindowHint(GLFW_CONTEXT_VERSION_MAJOR, 3);
    glfwWindowHint(GLFW_CONTEXT_VERSION_MINOR, 0);
    //glfwWindowHint(GLFW_OPENGL_PROFILE, GLFW_OPENGL_CORE_PROFILE);  // 3.2+ only
    //glfwWindowHint(GLFW_OPENGL_FORWARD_COMPAT, GL_TRUE);            // 3.0+ only
#endif

    // Create window with graphics context
    currentGLFWWindow = glfwCreateWindow(windowWidth, windowHeight, "Output image (OpenGL + GLFW)", NULL, NULL);
    if (currentGLFWWindow == NULL)
        return;
    glfwMakeContextCurrent(currentGLFWWindow);
    glfwSwapInterval(3); // Enable vsync

    if (!gladLoadGL()) {
        // GLAD failed
        printf( "GLAD failed to initialize :(" );
        return;
    }

    //Change GL settings
    glViewport(0, 0, windowWidth, windowHeight); // use a screen size of WIDTH x HEIGHT

    glMatrixMode(GL_PROJECTION);     // Make a simple 2D projection on the entire window
    glLoadIdentity();
    glOrtho(0.0, windowWidth, windowHeight, 0.0, 0.0, 100.0);

    glMatrixMode(GL_MODELVIEW);    // Set the matrix mode to object modeling

    glClearColor(0.0f, 0.0f, 0.0f, 0.0f);
    glClearDepth(0.0f);
    glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); // Clear the window
}

/** Renders the textures on the GLFW window and requests GLFW to update */
void showTextures(GLuint top, GLuint bottom) {
    // Clear color and depth buffers
    glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
    glMatrixMode(GL_MODELVIEW);     // Operate on model-view matrix

    glBindTexture(GL_TEXTURE_2D, top);
    /* Draw top quad */
    glEnable(GL_TEXTURE_2D);
    glBegin(GL_QUADS);
    glTexCoord2i(0, 0); glVertex2i(0, 0);
    glTexCoord2i(0, 1); glVertex2i(0, windowHeight/2);
    glTexCoord2i(1, 1); glVertex2i(windowWidth, windowHeight / 2);
    glTexCoord2i(1, 0); glVertex2i(windowWidth, 0);
    glEnd();
    glDisable(GL_TEXTURE_2D);
    /* Draw top quad */
    glBindTexture(GL_TEXTURE_2D, bottom);
    glEnable(GL_TEXTURE_2D);
    glBegin(GL_QUADS);
    glTexCoord2i(0, 0); glVertex2i(0, windowHeight / 2);
    glTexCoord2i(0, 1); glVertex2i(0, windowHeight);
    glTexCoord2i(1, 1); glVertex2i(windowWidth, windowHeight);
    glTexCoord2i(1, 0); glVertex2i(windowWidth, windowHeight / 2);
    glEnd();
    glDisable(GL_TEXTURE_2D);


    glfwSwapBuffers(currentGLFWWindow);
    glfwPollEvents();
}


int main() {
    initGL();

    int imageWidth = windowWidth;
    int imageHeight = windowHeight / 2;

    uint8_t* imageData = new uint8_t[imageWidth * imageHeight * 3];

    Processor p;

    while (!glfwWindowShouldClose(currentGLFWWindow))
    {
        //Process the image here
        p.setInput(imageData, imageWidth, imageHeight);
        p.processData();
        showTextures(p.getInputTexture(), p.getOutputTexture());
    }
}

Answer 1

TL;DR: I can see at least 2 ways forward here, either convert your data to 4 byte pixels (somehow) and use cudaMemcpy2DToArray , or allow the CUDA kernel to take in raw data (instead of using a surface as input). TL;DR：我可以在这里看到至少 2 种方法，要么将数据转换为 4 字节像素（以某种方式）并使用cudaMemcpy2DToArray ，要么允许 CUDA 内核接收原始数据（而不是使用表面作为输入）。 I'll try to demonstrate both, although I don't wish to put in a large effort at polishing this, so really just demonstrating ideas.我将尝试展示两者，尽管我不想花太多精力来完善它，所以实际上只是展示想法。

This answer is working off the code you provided in an edit which is not your latest.此答案正在处理您在不是最新的编辑中提供的代码。 However in the subsequent edits, mainly you seem to be just ripping out OpenCV, which I would normally applaud.然而，在随后的编辑中，主要是你似乎只是在撕掉 OpenCV，我通常会鼓掌。 However, since I've worked off your edit that had OpenCV in it, I've elected to use an OpenCV "test case" of my own.但是，由于我已经完成了包含 OpenCV 的编辑，因此我选择使用我自己的 OpenCV“测试用例”。

Using 4 byte-per-pixel data, and cudaMemcpy2DToArray : This seems to adhere most closely to what you have demonstrated, albeit commented-out.使用每像素 4 字节的数据和cudaMemcpy2DToArray ：这似乎与您所展示的最接近，尽管已注释掉。 The idea is we will access the input data by copying it to the CUDA array (acquired from the interop mechanism) directly.这个想法是我们将通过将输入数据直接复制到 CUDA 数组（从互操作机制获取）来访问输入数据。 As you had previously pointed out, cudaMemcpyToArray is deprecated , so we won't use that.正如您之前指出的， cudaMemcpyToArray已弃用，因此我们不会使用它。 Furthermore, our data format (bytes per pixel) has to match what is in the array.此外，我们的数据格式（每像素字节数）必须与数组中的内容相匹配。 I think there are a number of ways to solve this, depending on your overall pipeline, but the approach I show here isn't efficient, it's just to demonstrate that the method is "workable".我认为有多种方法可以解决此问题，具体取决于您的整体管道，但我在这里展示的方法效率不高，只是为了证明该方法是“可行的”。 If there is a way to use 4 byte per pixel data in your pipeline, however, you may be able to get rid of the "inefficiency" here.但是，如果有一种方法可以在您的管道中使用每像素 4 字节的数据，那么您可能能够摆脱这里的“低效率”。 To use this method, compile the code with the -DUSE_1 switch.要使用此方法，请使用-DUSE_1开关编译代码。
Input of the data through the kernel.通过内核输入数据。 We can skip the inefficiency of the first case by just allowing the kernel to do the 3-byte to 4-byte conversion of data on the fly.我们可以通过只允许内核动态执行 3 字节到 4 字节的数据转换来跳过第一种情况的低效率。 Either way, there is a copy of data from host to device, but this method doesn't require 4 byte per pixel input data.无论哪种方式，都存在从主机到设备的数据副本，但这种方法不需要每像素输入数据 4 字节。

Here is code demonstrating both options:这是演示这两个选项的代码：

//nvcc -arch=sm_35 -o t19 glad/src/glad.c t19.cu -lGL -lGLU -I./glad/include -lglfw -std=c++11 -lopencv_core -lopencv_highgui -lopencv_imgcodecs -Wno-deprecated-gpu-targets
#include <glad/glad.h>
#include <GLFW/glfw3.h>

#include <cudaGL.h>
#include <cuda_gl_interop.h>

#include <iostream>
#include <opencv2/highgui.hpp>


/** Macro for checking if CUDA has problems */
#define cudaCheckError() { \
    cudaError_t err = cudaGetLastError(); \
    if(err != cudaSuccess) { \
      printf("Cuda error: %s:%d: %s\n", __FILE__, __LINE__, cudaGetErrorString(err)); \
      exit(1); \
    } \
  }


/*Window dimensions*/
//const int windowWidth = 1280, windowHeight = 720;
/*Window address*/
GLFWwindow* currentGLFWWindow = 0;


/**
 * A simple image processing kernel that copies the inverted data from the input surface to the output surface.
 */
__global__ void kernel(cudaSurfaceObject_t input, cudaSurfaceObject_t output, int width, int height, uint8_t *data) {

    //Get the pixel index
    unsigned int xPx = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int yPx = threadIdx.y + blockIdx.y * blockDim.y;


    //Don't do any computation if this thread is outside of the surface bounds.
    if (xPx >= width || yPx >= height) return;

    //Copy the contents of input to output.
#ifdef USE_1
    uchar4 pixel = { 255,128,0,255 };
    //Read a pixel from the input. Disable to default to the flat orange color above
    surf2Dread<uchar4>(&pixel, input, xPx * sizeof(uchar4), yPx, cudaBoundaryModeClamp);

#else
    uchar4 pixel;
    pixel.x = data[(xPx+yPx*width)*3 + 0];
    pixel.y = data[(xPx+yPx*width)*3 + 1];
    pixel.z = data[(xPx+yPx*width)*3 + 2];
    pixel.w = 255;
    surf2Dwrite(pixel, input, xPx * sizeof(uchar4), yPx);
#endif
    //Invert the color
    pixel.x = ~pixel.x;
    pixel.y = ~pixel.y;
    pixel.z = ~pixel.z;
    //Write the new pixel color to the 
    surf2Dwrite(pixel, output, xPx * sizeof(uchar4), yPx);
}

class Processor {
public:
    void setInput( uint8_t* const data, int imageWidth, int imageHeight);
    void processData(uint8_t *data, uint8_t *d_data);
    GLuint getInputTexture();
    GLuint getOutputTexture();
    void writeOutputTo(uint8_t* destination);
private:
    /**
    * @brief True if the textures and surfaces are initialized.
    *
    * Prevents memory leaks
    */
    bool surfacesInitialized = false;
    /**
     * @brief The width and height of a texture/surface pair.
     *
     */
    struct ImgDim { int width, height; };
    /**
     * @brief Creates a CUDA surface object, CUDA resource, and OpenGL texture from some data.
     */
    void createTextureSurfacePair(const ImgDim& dimensions, uint8_t* const data, GLuint& textureOut, cudaGraphicsResource_t& graphicsResourceOut, cudaSurfaceObject_t& surfaceOut);
    /**
     * @brief Destroys every CUDA surface object, CUDA resource, and OpenGL texture created by this instance.
     */
    void destroyEverything();
    /**
     * @brief The dimensions of an image and its corresponding texture.
     *
     */
    ImgDim imageInputDimensions, imageOutputDimensions;
    /**
     * @brief A CUDA surface that can be read to, written from, or synchronized with a Mat or
     * OpenGL texture
     *
     */
    cudaSurfaceObject_t d_imageInputTexture = 0, d_imageOutputTexture = 0;
    /**
     * @brief A CUDA resource that's bound to an array in CUDA memory
     */
    cudaGraphicsResource_t d_imageInputGraphicsResource, d_imageOutputGraphicsResource;
    /**
     * @brief A renderable OpenGL texture that is synchronized with the CUDA data
     * @see d_imageInputTexture, d_imageOutputTexture
     */
    GLuint imageInputTexture = 0, imageOutputTexture = 0;
    /** Returns true if nothing can be rendered */
    bool empty() { return imageInputTexture == 0; }

};


void Processor::setInput(uint8_t* const data, int imageWidth, int imageHeight)
{


    //Same-size images don't need texture regeneration, so skip that.
    if (imageHeight == imageInputDimensions.height && imageWidth == imageInputDimensions.width) {


        /*
        Possible shortcut: we know the input is the same size as the texture and CUDA surface object.
        So instead of destroying the surface and texture, why not just overwrite them?

        That's what I try to do in the following block, but because "data" is BGR and the texture
        is RGBA, the channels get all messed up.
        */

        //Use the input surface's CUDAResourceDesc to gain access to the surface data array
#ifdef USE_1
    struct cudaResourceDesc resDesc;
        memset(&resDesc, 0, sizeof(resDesc));
        cudaGetSurfaceObjectResourceDesc(&resDesc, d_imageInputTexture);
        cudaCheckError();
        uint8_t *data4 = new uint8_t[imageInputDimensions.width*imageInputDimensions.height*4];
    for (int i = 0; i < imageInputDimensions.width*imageInputDimensions.height; i++){
        data4[i*4+0] = data[i*3+0];
        data4[i*4+1] = data[i*3+1];
        data4[i*4+2] = data[i*3+2];
        data4[i*4+3] = 255;}
        //Copy the data from the input array to the surface
//        cudaMemcpyToArray(resDesc.res.array.array, 0, 0, data, imageInputDimensions.width * imageInputDimensions.height * 3, cudaMemcpyHostToDevice);
    cudaMemcpy2DToArray(resDesc.res.array.array, 0, 0, data4, imageInputDimensions.width*4, imageInputDimensions.width*4, imageInputDimensions.height, cudaMemcpyHostToDevice);
    cudaCheckError();
        delete[] data4;
#endif
        //Set status flags
        surfacesInitialized = true;

        return;
    }


    //Clear everything that originally existed in the texture/surface
    destroyEverything();

    //Get the size of the image and place it here.
    imageInputDimensions.width = imageWidth;
    imageInputDimensions.height = imageHeight;
    imageOutputDimensions.width = imageWidth;
    imageOutputDimensions.height = imageHeight;

    //Create the input surface/texture pair
    createTextureSurfacePair(imageInputDimensions, data, imageInputTexture, d_imageInputGraphicsResource, d_imageInputTexture);

    //Create the output surface/texture pair
    uint8_t* outData = new uint8_t[imageOutputDimensions.width * imageOutputDimensions.height * 3];
    createTextureSurfacePair(imageOutputDimensions, outData, imageOutputTexture, d_imageOutputGraphicsResource, d_imageOutputTexture);
    delete outData;

    //Set status flags
    surfacesInitialized = true;
}

void Processor::processData(uint8_t *data, uint8_t *d_data)
{
    const int threadsPerBlock = 128;

    //Call the algorithm

    //Set the number of blocks to call the kernel with.
    dim3 blocks((unsigned int)ceil((float)imageInputDimensions.width / threadsPerBlock), imageInputDimensions.height);
#ifndef USE_1
    cudaMemcpy(d_data, data, imageInputDimensions.width*imageInputDimensions.height*3, cudaMemcpyHostToDevice);
#endif
    kernel <<<blocks, threadsPerBlock >>> (d_imageInputTexture, d_imageOutputTexture, imageInputDimensions.width, imageInputDimensions.height, d_data);

    //Sync the surface with the texture
    cudaDeviceSynchronize();
    cudaCheckError();
}

GLuint Processor::getInputTexture()
{
    return imageInputTexture;
}

GLuint Processor::getOutputTexture()
{
    return imageOutputTexture;
}

void Processor::writeOutputTo(uint8_t* destination)
{
    //Haven't figured this out yet
}

void Processor::createTextureSurfacePair(const Processor::ImgDim& dimensions, uint8_t* const data, GLuint& textureOut, cudaGraphicsResource_t& graphicsResourceOut, cudaSurfaceObject_t& surfaceOut) {

    // Create the OpenGL texture that will be displayed with GLAD and GLFW
    glGenTextures(1, &textureOut);
    // Bind to our texture handle
    glBindTexture(GL_TEXTURE_2D, textureOut);
    // Set texture interpolation methods for minification and magnification
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
    // Set texture clamping method
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP);
    // Create the texture and its attributes
    glTexImage2D(GL_TEXTURE_2D,     // Type of texture
        0,                // Pyramid level (for mip-mapping) - 0 is the top level
        GL_RGBA,          // Internal color format to convert to
        dimensions.width,            // Image width  i.e. 640 for Kinect in standard mode
        dimensions.height,           // Image height i.e. 480 for Kinect in standard mode
        0,                // Border width in pixels (can either be 1 or 0)
        GL_BGR,          // Input image format (i.e. GL_RGB, GL_RGBA, GL_BGR etc.)
        GL_UNSIGNED_BYTE, // Image data type.
        data);            // The actual image data itself
    //Note that the type of this texture is an RGBA UNSIGNED_BYTE type. When CUDA surfaces
    //are synchronized with OpenGL textures, the surfaces will be of the same type.
    //They won't know or care about their data types though, for they are all just byte arrays
    //at heart. So be careful to ensure that any CUDA kernel that handles a CUDA surface
    //uses it as an appropriate type. You will see that the update_surface kernel (defined 
    //above) treats each pixel as four unsigned bytes along the X-axis: one for red, green, blue,
    //and alpha respectively.

    //Create the CUDA array and texture reference
    cudaArray* bitmap_d;
    //Register the GL texture with the CUDA graphics library. A new cudaGraphicsResource is created, and its address is placed in cudaTextureID.
    //Documentation: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__OPENGL.html#group__CUDART__OPENGL_1g80d12187ae7590807c7676697d9fe03d
    cudaGraphicsGLRegisterImage(&graphicsResourceOut, textureOut, GL_TEXTURE_2D,
        cudaGraphicsRegisterFlagsNone);
    cudaCheckError();
    //Map graphics resources for access by CUDA.
    //Documentation: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__INTEROP.html#group__CUDART__INTEROP_1gad8fbe74d02adefb8e7efb4971ee6322
    cudaGraphicsMapResources(1, &graphicsResourceOut, 0);
    cudaCheckError();
    //Get the location of the array of pixels that was mapped by the previous function and place that address in bitmap_d
    //Documentation: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__INTEROP.html#group__CUDART__INTEROP_1g0dd6b5f024dfdcff5c28a08ef9958031
    cudaGraphicsSubResourceGetMappedArray(&bitmap_d, graphicsResourceOut, 0, 0);
    cudaCheckError();
    //Create a CUDA resource descriptor. This is used to get and set attributes of CUDA resources.
    //This one will tell CUDA how we want the bitmap_surface to be configured.
    //Documentation for the struct: https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaResourceDesc.html#structcudaResourceDesc
    struct cudaResourceDesc resDesc;
    //Clear it with 0s so that some flags aren't arbitrarily left at 1s
    memset(&resDesc, 0, sizeof(resDesc));
    //Set the resource type to be an array for convenient processing in the CUDA kernel.
    //List of resTypes: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g067b774c0e639817a00a972c8e2c203c
    resDesc.resType = cudaResourceTypeArray;
    //Bind the new descriptor with the bitmap created earlier.
    resDesc.res.array.array = bitmap_d;
    //Create a new CUDA surface ID reference.
    //This is really just an unsigned long long.
    //Docuentation: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1gbe57cf2ccbe7f9d696f18808dd634c0a
    surfaceOut = 0;
    //Create the surface with the given description. That surface ID is placed in bitmap_surface.
    //Documentation: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__SURFACE__OBJECT.html#group__CUDART__SURFACE__OBJECT_1g958899474ab2c5f40d233b524d6c5a01
    cudaCreateSurfaceObject(&surfaceOut, &resDesc);
    cudaCheckError();
}

void Processor::destroyEverything()
{
    if (surfacesInitialized) {

        //Input image CUDA surface
        cudaDestroySurfaceObject(d_imageInputTexture);
        cudaGraphicsUnmapResources(1, &d_imageInputGraphicsResource);
        cudaGraphicsUnregisterResource(d_imageInputGraphicsResource);
        d_imageInputTexture = 0;

        //Output image CUDA surface
        cudaDestroySurfaceObject(d_imageOutputTexture);
        cudaGraphicsUnmapResources(1, &d_imageOutputGraphicsResource);
        cudaGraphicsUnregisterResource(d_imageOutputGraphicsResource);
        d_imageOutputTexture = 0;

        //Input image GL texture
        glDeleteTextures(1, &imageInputTexture);
        imageInputTexture = 0;

        //Output image GL texture
        glDeleteTextures(1, &imageOutputTexture);
        imageOutputTexture = 0;

        surfacesInitialized = false;
    }
}


/** A way to initialize OpenGL with GLFW and GLAD */
void initGL(int windowWidth, int windowHeight) {

    // Setup window
    if (!glfwInit())
        return;

    // Decide GL+GLSL versions
#if __APPLE__
    // GL 3.2 + GLSL 150
    const char* glsl_version = "#version 150";
    glfwWindowHint(GLFW_CONTEXT_VERSION_MAJOR, 3);
    glfwWindowHint(GLFW_CONTEXT_VERSION_MINOR, 2);
    glfwWindowHint(GLFW_OPENGL_PROFILE, GLFW_OPENGL_CORE_PROFILE);  // 3.2+ only
    glfwWindowHint(GLFW_OPENGL_FORWARD_COMPAT, GL_TRUE);            // Required on Mac
#else
    // GL 3.0 + GLSL 130
    //const char* glsl_version = "#version 130";
    glfwWindowHint(GLFW_CONTEXT_VERSION_MAJOR, 3);
    glfwWindowHint(GLFW_CONTEXT_VERSION_MINOR, 0);
    //glfwWindowHint(GLFW_OPENGL_PROFILE, GLFW_OPENGL_CORE_PROFILE);  // 3.2+ only
    //glfwWindowHint(GLFW_OPENGL_FORWARD_COMPAT, GL_TRUE);            // 3.0+ only
#endif

    // Create window with graphics context
    currentGLFWWindow = glfwCreateWindow(windowWidth, windowHeight, "Output image (OpenGL + GLFW)", NULL, NULL);
    if (currentGLFWWindow == NULL)
        return;
    glfwMakeContextCurrent(currentGLFWWindow);
    glfwSwapInterval(3); // Enable vsync

    if (!gladLoadGL()) {
        // GLAD failed
        printf( "GLAD failed to initialize :(" );
        return;
    }

    //Change GL settings
    glViewport(0, 0, windowWidth, windowHeight); // use a screen size of WIDTH x HEIGHT

    glMatrixMode(GL_PROJECTION);     // Make a simple 2D projection on the entire window
    glLoadIdentity();
    glOrtho(0.0, windowWidth, windowHeight, 0.0, 0.0, 100.0);

    glMatrixMode(GL_MODELVIEW);    // Set the matrix mode to object modeling

    glClearColor(0.0f, 0.0f, 0.0f, 0.0f);
    glClearDepth(0.0f);
    glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); // Clear the window
}

/** Renders the textures on the GLFW window and requests GLFW to update */
void showTextures(GLuint top, GLuint bottom, int windowWidth, int windowHeight) {
    // Clear color and depth buffers
    glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
    glMatrixMode(GL_MODELVIEW);     // Operate on model-view matrix

    glBindTexture(GL_TEXTURE_2D, top);
    /* Draw top quad */
    glEnable(GL_TEXTURE_2D);
    glBegin(GL_QUADS);
    glTexCoord2i(0, 0); glVertex2i(0, 0);
    glTexCoord2i(0, 1); glVertex2i(0, windowHeight/2);
    glTexCoord2i(1, 1); glVertex2i(windowWidth, windowHeight / 2);
    glTexCoord2i(1, 0); glVertex2i(windowWidth, 0);
    glEnd();
    glDisable(GL_TEXTURE_2D);
    /* Draw bottom quad */
    glBindTexture(GL_TEXTURE_2D, bottom);
    glEnable(GL_TEXTURE_2D);
    glBegin(GL_QUADS);
    glTexCoord2i(0, 0); glVertex2i(0, windowHeight / 2);
    glTexCoord2i(0, 1); glVertex2i(0, windowHeight);
    glTexCoord2i(1, 1); glVertex2i(windowWidth, windowHeight);
    glTexCoord2i(1, 0); glVertex2i(windowWidth, windowHeight / 2);
    glEnd();
    glDisable(GL_TEXTURE_2D);


    glfwSwapBuffers(currentGLFWWindow);
    glfwPollEvents();
}


int main() {
    using namespace cv;
    using namespace std;

//    initGL();

    std::string filename = "./lena.pgm";

    Mat image;
    image = imread(filename, CV_LOAD_IMAGE_COLOR);   // Read the file

    if(! image.data )                              // Check for invalid input
    {
        cout <<  "Could not open or find the image" << std::endl ;
        return -1;
    }
    int windoww = 1280;
    int windowh = 720;
    initGL(windoww,windowh);
 
    uint8_t *d_data;
    cudaMalloc(&d_data, image.cols*image.rows*3);
    Processor p;
    for (int i = 0; i < image.cols; i++)
    {
        image.data[i*3+0] = 0;
        image.data[i*3+1] = 0;
        image.data[i*3+2] = 0;
        //Process the image here
        p.setInput(image.data, image.cols, image.rows);
        p.processData(image.data, d_data);
        showTextures(p.getInputTexture(), p.getOutputTexture(), windoww, windowh);
    }
}

Notes:笔记：

The compilation command is given in the comment in the first line编译命令在第一行的注释中给出
I created a "video" of sorts using a single image.我使用单个图像创建了各种“视频”。 The "video" will show the image with a black or white line moving horizontally from left to right in the top pixel row of the image. “视频”将在图像的顶部像素行中显示带有从左到右水平移动的黑线或白线的图像。 The input image is lena.pgm which can be found in the CUDA samples (for example, at /usr/local/cuda-10.1/samples/3_Imaging/SobelFilter/data/lena.pgm ).输入图像是lena.pgm ，它可以在 CUDA 示例中找到（例如，在/usr/local/cuda-10.1/samples/3_Imaging/SobelFilter/data/lena.pgm ）。
It looks to me like you are "sharing" resources between OpenGL and CUDA.在我看来，您正在 OpenGL 和 CUDA 之间“共享”资源。 This doesn't look like the right map/unmap sequence to me, but it seems to be working, and it doesn't seem to be the focus of your question.对我来说，这看起来不像是正确的映射/取消映射序列，但它似乎有效，而且似乎不是您问题的重点。 I haven't spent any time investigating.我没有花时间调查。 I may have missed something.我可能错过了一些东西。
I'm not suggesting this code is defect free or suitable for any particular purpose.我并不是说这段代码没有缺陷或适合任何特定目的。 It is mostly your code.它主要是您的代码。 I've modified it slightly to demonstrate some ideas described in the text.我稍微修改了它以演示文本中描述的一些想法。
There shouldn't be any visual difference in the output whether you compile with -DUSE_1 or not.无论您是否使用-DUSE_1编译，输出中都不应该有任何视觉差异。

Answer 2

This is an useful feature that came across first in ( https://www.3dgep.com/opengl-interoperability-with-cuda/ ), and I have improved upon it to use latest CUDA APIs and flow.这是一个有用的功能，首先在 ( https://www.3dgep.com/opengl-interoperability-with-cuda/ ) 中出现，我对其进行了改进以使用最新的 CUDA API 和流程。 You can refer to these 2 functions in cudammf.您可以在 cudammf 中参考这两个函数。

https://github.com/prabindh/cudammf/blob/5f93358784fcbaae7eea0850424c59d2ed057dab/cuda_postproces.cu#L119 https://github.com/prabindh/cudammf/blob/5f93358784fcbaae7eea0850424c59d2ed057dab/cuda_postproces.cu#L119

https://github.com/prabindh/cudammf/blob/5f93358784fcbaae7eea0850424c59d2ed057dab/decoder3.cpp#L507 https://github.com/prabindh/cudammf/blob/5f93358784fcbaae7eea0850424c59d2ed057dab/decoder3.cpp#L507

Basic working is as below:基本工作如下：

Create a regular GL texture (GLTextureId).创建一个常规的 GL 纹理 (GLTextureId)。 Map it for CUDA access, via cudaGraphicsGLRegisterImage通过cudaGraphicsGLRegisterImage映射它以进行 CUDA 访问
Do some CUDA processing, and result is in a CUDA buffer做一些 CUDA 处理，结果在 CUDA 缓冲区中
USe cudaMemcpyToArray to transfer between the above 2 device memories使用cudaMemcpyToArray在上述 2 个设备内存之间进行传输

If your output is coming from a Nvidia codec output, you should also refer to the AppDecGL sample in the Nvidia Video SDK ( https://developer.nvidia.com/nvidia-video-codec-sdk ).如果您的输出来自 Nvidia 编解码器输出，您还应该参考 Nvidia 视频 SDK ( https://developer.nvidia.com/nvidia-video-codec-sdk ) 中的AppDecGL示例。

如何在 NVIDIA GPU 上处理来自 CPU 的 RGB 数据并使用 OpenGL 纹理可视化数据

问题描述

2 个解决方案

解决方案1
3 已采纳 2020-08-29 19:03:12

解决方案2
1 2020-08-31 12:39:41

如何在 NVIDIA GPU 上处理来自 CPU 的 RGB 数据并使用 OpenGL 纹理可视化数据

问题描述

2 个解决方案

解决方案1 3 已采纳 2020-08-29 19:03:12

解决方案2 1 2020-08-31 12:39:41

解决方案1
3 已采纳 2020-08-29 19:03:12

解决方案2
1 2020-08-31 12:39:41