CUDA in C: How to fix Error 11 with cudaMemcpyAsync

Question

I am currently trying to get a simple multi-GPU program running with CUDA. What it basically does is it copies a large array with some dummy data in chunks to the GPUs, which do some math, and then copy the resulting array back.

I dont get any errors in the output of VS2017, but some error messages I have set up show me that while trying to copy either H2D or D2H. It tells me that a cudaErrorInvalidValue is occuring. Also, when using the cudaFree(); function, i get a cudaErrorInvalidDevicePointer error.

The output of the program, the result, is completely wrong. The kernel is, for testing purposes, only setting every value of the output array to a value of 50. The result is a relatively large negative number, always the same no matter what the kernel does.

I have already tried to use a pointer that is not part of a struct, but is defined right before the cudaMalloc, where it is used first. That did not change anything.

This is the function that runs the Kernel:

void runKernel(int device, int Repetition, float* h_data, float* h_out, int MemoryPerComputation, int BLOCK_N, int THREAD_N, GPUplan gpuplan, KernelPlan kernelPlan)
{
    cudaSetDevice(device);

    cudaStreamCreate(&gpuplan.stream);

    cudaMemcpyAsync(gpuplan.d_data_ptr, h_data, kernelPlan.Computations * MemoryPerComputation, cudaMemcpyHostToDevice, gpuplan.stream); //asynchronous memory copy of the data array h2d

    cudaError_t x = cudaGetLastError();
    if (x != cudaSuccess) {
        printf("Memcpy H2D on GPU %i: Error %i\n", device, x);
    }

    dummyKernel << <BLOCK_N, THREAD_N, 0, gpuplan.stream >> > (gpuplan.d_data_ptr, gpuplan.d_out_ptr, kernelPlan.ComputationsPerThread, kernelPlan.AdditionalComputationThreadCount); //run kernel

    x = cudaGetLastError();
    if (x != cudaSuccess) {
        printf("no successfull kernel launch\n Kernel Launch Error %i \n", x);
    }
    else {
        printf("kernel ran.\n");
    }

    cudaMemcpyAsync(h_out, gpuplan.d_out_ptr, kernelPlan.Computations * MemoryPerComputation, cudaMemcpyDeviceToHost, gpuplan.stream); //asynchronous memory copy of the output array d2h

    x = cudaGetLastError();
    if (x != cudaSuccess) {
        printf("Memcpy D2H on GPU %i: Error %i\n", device, x);
    }

    cudaStreamDestroy(gpuplan.stream);
}

Then here, how the struct is defined in the "kernel.h":

#ifndef KERNEL_H
#define KERNEL_H

#include "cuda_runtime.h"


//GPU plan
typedef struct
{
    unsigned int Computations; //computations on this GPU

    unsigned int Repetitions; // amount of kernel repetitions

    unsigned int ComputationsPerRepetition; // amount of computations in every kernel execution
    unsigned int AdditionalComputationRepetitionsCount; // amount of repetitions that need to do one additional computation

    unsigned int DataStartingPoint; // tells the kernel launch at which point in the DATA array this GPU has to start working

    float* d_data_ptr;
    float* d_out_ptr;

    cudaStream_t stream;
} GPUplan;

typedef struct
{
    unsigned int Computations;

    unsigned int ComputationsPerThread; // number of computations every thread of this repetition on this GPU has to do
    unsigned int AdditionalComputationThreadCount; // number of threads in this repetition on this GPU that have to 

    unsigned int DataStartingPoint; // tells the kernel launch at which point in the DATA array this repetition has to start working

} KernelPlan;

GPUplan planGPUComputation(int DATA_N, int GPU_N, int device, long long MemoryPerComputation, int dataCounter);

KernelPlan planKernelComputation(int GPUDataStartingPoint, int GPUComputationsPerRepetition, int GPUAdditionalComputationRepetitionsCount, int Repetition, int dataCounter, int THREAD_N, int BLOCK_N);

void memAllocation(int device, int MemoryPerComputation, GPUplan gpuPlan, KernelPlan kernelPlan);

void runKernel(int device, int Repetition, float* h_data, float* h_out, int MemoryPerComputation, int BLOCK_N, int THREAD_N, GPUplan gpuplan, KernelPlan kernelPlan);

void memFree(int device, GPUplan gpuPlan);

__global__ void dummyKernel(float *d_data, float *d_out, int d_ComputationsPerThread, int d_AdditionalComputationThreadCount);

#endif

here the part of code that calls runKernel:

int GPU_N;
cudaGetDeviceCount(&GPU_N);

const int BLOCK_N = 32;
const int THREAD_N = 1024;

const int DATA_N = 144000;

const int MemoryPerComputation = sizeof(float);

float *h_data;
float *h_out;

h_data = (float *)malloc(MemoryPerComputation * DATA_N);
h_out = (float *)malloc(MemoryPerComputation * DATA_N);

float* sourcePointer;
float* destPointer;

for (int i = 0; i < maxRepetitionCount; i++) // repeat this enough times so that the GPU with the most repetitions will get through all of them
    {
        //malloc
        for (int j = 0; j < GPU_N; j++)
        {
            if (plan[j].Repetitions >= i) // when this GPU has to do at least i repetitions
            {
                memAllocation(j, MemoryPerComputation, plan[j], kernelPlan[j*MAX_REP_COUNT + i]);
            }
        }

        //kernel launch/memcpy
        for (int j = 0; j < GPU_N; j++)
        {
            if (plan[j].Repetitions >= i) // when this GPU has to do at least i repetitions
            {
                sourcePointer = h_data + kernelPlan[j*MAX_REP_COUNT + i].DataStartingPoint;
                destPointer = h_out + kernelPlan[j*MAX_REP_COUNT + i].DataStartingPoint;

                runKernel(j, i, sourcePointer, destPointer, MemoryPerComputation, BLOCK_N, THREAD_N, plan[j], kernelPlan[j*MAX_REP_COUNT + i]);
            }
        }

        for (int j = 0; j < GPU_N; j++)
        {
            if (plan[j].Repetitions >= i) // when this GPU has to do at least i repetitions
            {
                memFree(j, plan[j]);
            }
        }
    }

I dont think that the kernel itself would be of any importance here since the memcpy error already appears before it is even executed.

The expected output is, that every element of the output array is 50. Instead, every element is -431602080.0

The array is a float array.

EDIT: here is the full code used to reproduce the problem (in addition to kernel.h from above):


#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <stdlib.h>

#include "kernel.h"
#define MAX_GPU_COUNT 32
#define MAX_REP_COUNT 64

__global__ void dummyKernel(float *d_data, float *d_out, int d_ComputationsPerThread, int d_AdditionalComputationThreadCount) {
    int computations = d_ComputationsPerThread; //computations to be performed in this repetition on this GPU
    const int threadID = blockDim.x * blockIdx.x + threadIdx.x; //thread id within GPU Repetition

    if (threadID > d_AdditionalComputationThreadCount) {
        computations++; //check if thread has to do an additional computation
    } 

    for (int i = 0; i < computations; i++) {
        d_out[i * blockDim.x * gridDim.x + threadID] = 50;
    }
}

GPUplan planGPUComputation(int DATA_N, int GPU_N, int device, long long MemoryPerComputation, int dataCounter)
{
    GPUplan plan;
    size_t free, total;

    //computations on GPU #device
    plan.Computations = DATA_N / GPU_N;
    //take into account odd data size for this GPU
    if (DATA_N % GPU_N > device) {
        plan.Computations++;
    }

    plan.DataStartingPoint = dataCounter;

    //get memory information
    cudaSetDevice(device);
    cudaMemGetInfo(&free, &total);

    //calculate Repetitions on this GPU #device
    plan.Repetitions = ((plan.Computations * MemoryPerComputation / free) + 1);
    printf("Repetitions: %i\n", plan.Repetitions);

    if (plan.Repetitions > MAX_REP_COUNT) {
        printf("Repetition count larger than MAX_REP_COUNT %i\n\n", MAX_REP_COUNT);
    }

    //calculate Computations per Repetition
    plan.ComputationsPerRepetition = plan.Computations / plan.Repetitions;

    //calculate how many Repetitions have to do an additional Computation
    plan.AdditionalComputationRepetitionsCount = plan.Computations % plan.Repetitions;

    return plan;
}

KernelPlan planKernelComputation(int GPUDataStartingPoint, int GPUComputationsPerRepetition, int GPUAdditionalComputationRepetitionsCount, int Repetition, int dataCounter, int THREAD_N, int BLOCK_N)
{
    KernelPlan plan;
    //calculate total Calculations in this Repetition
    plan.Computations = GPUComputationsPerRepetition;

    if (GPUAdditionalComputationRepetitionsCount > Repetition) {
        plan.Computations++;
    }

    plan.ComputationsPerThread = plan.Computations / (THREAD_N * BLOCK_N); // Computations every thread has to do (+- 1)
    plan.AdditionalComputationThreadCount = plan.Computations % (THREAD_N * BLOCK_N); // how many threads have to do +1 calculation

    plan.DataStartingPoint = GPUDataStartingPoint + dataCounter;

    return plan;
}

void memAllocation(int device, int MemoryPerComputation, GPUplan gpuPlan, KernelPlan kernelPlan)
{
    cudaSetDevice(device); //select device to allocate memory on
    cudaError_t x = cudaGetLastError();
    if (x != cudaSuccess) {
        printf("Error Selecting device %i: Error %i\n", device, x);
    }
    cudaMalloc((void**)&(gpuPlan.d_data_ptr), MemoryPerComputation * kernelPlan.Computations); // device data array memory allocation
    x = cudaGetLastError();
    if (x != cudaSuccess) {
        printf("Malloc 1 on GPU %i: Error %i\n", device, x);
    }

    cudaMalloc((void**)&(gpuPlan.d_out_ptr), MemoryPerComputation * kernelPlan.Computations); // device output array memory allocation
    x = cudaGetLastError();
    if (x != cudaSuccess) {
        printf("Malloc 2 on GPU %i: Error %i\n", device, x);
    }
}

void runKernel(int device, int Repetition, float* h_data, float* h_out, int MemoryPerComputation, int BLOCK_N, int THREAD_N, GPUplan gpuplan, KernelPlan kernelPlan)
{
    cudaSetDevice(device);

    cudaStreamCreate(&gpuplan.stream);

    cudaMemcpyAsync(gpuplan.d_data_ptr, h_data, kernelPlan.Computations * MemoryPerComputation, cudaMemcpyHostToDevice, gpuplan.stream); //asynchronous memory copy of the data array h2d

    cudaError_t x = cudaGetLastError();
    if (x != cudaSuccess) {
        printf("Memcpy H2D on GPU %i: Error %i\n", device, x);
    }

    dummyKernel << <BLOCK_N, THREAD_N, 0, gpuplan.stream >> > (gpuplan.d_data_ptr, gpuplan.d_out_ptr, kernelPlan.ComputationsPerThread, kernelPlan.AdditionalComputationThreadCount); //run kernel

    x = cudaGetLastError();
    if (x != cudaSuccess) {
        printf("no successfull kernel launch\n Kernel Launch Error %i \n", x);
    }
    else {
        printf("kernel ran.\n");
    }

    cudaMemcpyAsync(h_out, gpuplan.d_out_ptr, kernelPlan.Computations * MemoryPerComputation, cudaMemcpyDeviceToHost, gpuplan.stream); //asynchronous memory copy of the output array d2h

    x = cudaGetLastError();
    if (x != cudaSuccess) {
        printf("Memcpy D2H on GPU %i: Error %i\n", device, x);
    }

    cudaStreamDestroy(gpuplan.stream);
}

void memFree(int device, GPUplan gpuPlan)
{
    cudaSetDevice(device); //select device to allocate memory on
    cudaFree(gpuPlan.d_data_ptr);
    cudaFree(gpuPlan.d_out_ptr);

    cudaError_t x = cudaGetLastError();
    if (x != cudaSuccess) {
        printf("Memfree on GPU %i: Error %i\n", device, x);
    }
    else {
        printf("memory freed.\n");
    }
    //17 = cudaErrorInvalidDevicePointer
}

int main()
{
    //get device count
    int GPU_N;
    cudaGetDeviceCount(&GPU_N);
    //adjust for device count larger than MAX_GPU_COUNT
    if (GPU_N > MAX_GPU_COUNT)
    {
        GPU_N = MAX_GPU_COUNT;
    }

    printf("GPU count: %i\n", GPU_N);

    //definitions for running the program
    const int BLOCK_N = 32;
    const int THREAD_N = 1024;

    const int DATA_N = 144000;

    const int MemoryPerComputation = sizeof(float);

    ///////////////////////////////////////////////////////////
    //Subdividing input data across GPUs
    //////////////////////////////////////////////

    //GPUplan
    GPUplan plan[MAX_GPU_COUNT];
    int dataCounter = 0;

    for (int i = 0; i < GPU_N; i++)
    {
        plan[i] = planGPUComputation(DATA_N, GPU_N, i, MemoryPerComputation, dataCounter);
        dataCounter += plan[i].Computations;
    }

    //KernelPlan
    KernelPlan kernelPlan[MAX_GPU_COUNT*MAX_REP_COUNT];

    for (int i = 0; i < GPU_N; i++) 
    {
        int GPURepetitions = plan[i].Repetitions;
        dataCounter = plan[i].DataStartingPoint;

        for (int j = 0; j < GPURepetitions; j++)
        {
            kernelPlan[i*MAX_REP_COUNT + j] = planKernelComputation(plan[i].DataStartingPoint, plan[i].ComputationsPerRepetition, plan[i].AdditionalComputationRepetitionsCount, j, dataCounter, THREAD_N, BLOCK_N);

            dataCounter += kernelPlan[i*MAX_REP_COUNT + j].Computations;
        }
    }

    float *h_data;
    float *h_out;

    h_data = (float *)malloc(MemoryPerComputation * DATA_N);
    h_out = (float *)malloc(MemoryPerComputation * DATA_N);

    //generate some input data
    for (int i = 0; i < DATA_N; i++) {
        h_data[i] = 2 * i;
    }

    //get highest repetition count
    int maxRepetitionCount = 0;
    for (int i = 0; i < GPU_N; i++) {
        if (plan[i].Repetitions > maxRepetitionCount) {
            maxRepetitionCount = plan[i].Repetitions;
        }
    }

    printf("maxRepetitionCount: %i\n\n", maxRepetitionCount);

    float* sourcePointer;
    float* destPointer;

    for (int i = 0; i < maxRepetitionCount; i++) // repeat this enough times so that the GPU with the most repetitions will get through all of them
    {
        //malloc
        for (int j = 0; j < GPU_N; j++)
        {
            if (plan[j].Repetitions >= i) // when this GPU has to do at least i repetitions
            {
                memAllocation(j, MemoryPerComputation, plan[j], kernelPlan[j*MAX_REP_COUNT + i]);
            }
        }

        //kernel launch/memcpy
        for (int j = 0; j < GPU_N; j++)
        {
            if (plan[j].Repetitions >= i) // when this GPU has to do at least i repetitions
            {
                sourcePointer = h_data + kernelPlan[j*MAX_REP_COUNT + i].DataStartingPoint;
                destPointer = h_out + kernelPlan[j*MAX_REP_COUNT + i].DataStartingPoint;

                runKernel(j, i, sourcePointer, destPointer, MemoryPerComputation, BLOCK_N, THREAD_N, plan[j], kernelPlan[j*MAX_REP_COUNT + i]);
            }
        }

        for (int j = 0; j < GPU_N; j++)
        {
            if (plan[j].Repetitions >= i) // when this GPU has to do at least i repetitions
            {
                memFree(j, plan[j]);
            }
        }
    }

    //printing expected results and results
    for (int i = 0; i < 50; i++)
    {
        printf("%f\t", h_data[i]);
        printf("%f\n", h_out[i]);
    }


    free(h_data);
    free(h_out);


    getchar();

    return 0;
}

Answer 1

The first problem has nothing to do with CUDA, actually. When you pass a struct by-value to a function in C or C++, a copy of that struct is made for use by the function. Modifications to that struct in the function have no effect on the original struct in the calling environment. This is affecting you in your memAllocation function:

void memAllocation(int device, int MemoryPerComputation, GPUplan gpuPlan, KernelPlan kernelPlan)
                                                                 ^^^^^^^
                                                                 passed by value
{
    cudaSetDevice(device); //select device to allocate memory on
    cudaError_t x = cudaGetLastError();
    if (x != cudaSuccess) {
        printf("Error Selecting device %i: Error %i\n", device, x);
    }
    cudaMalloc((void**)&(gpuPlan.d_data_ptr), MemoryPerComputation * kernelPlan.Computations); // device data array memory allocation
                         ^^^^^^^^^^^^^^^^^^
                         modifying the copy, not the original

This is fairly easily fixable by passing the gpuPlan struct by reference rather than by value. Modify both the prototype in the kernel.h header file, as well as the definition:

void memAllocation(int device, int MemoryPerComputation, GPUplan &gpuPlan, KernelPlan kernelPlan)
                                                                 ^

with that change, the struct is passed by reference, and modifications (such as the setting of the allocated pointers) will show up in the calling environment. This is the proximal reason for the invalid argument report on the cudaMemcpy operations. The pointers you were passing were unallocated, because your allocations were done on the pointer copies, not the originals.

After that change your code may appear to be running correctly. At least when I run it no errors are displayed and the outputs appear to be all set to 50.

However there are still problems with this code. If you run your code with cuda-memcheck (or turn on the memory checker functionality in nsight VSE) you should see errors associated with this line of code, which is indexing out of bounds:

__global__ void dummyKernel(float *d_data, float *d_out, int d_ComputationsPerThread, int d_AdditionalComputationThreadCount) {
...
    d_out[i * blockDim.x * gridDim.x + threadID] = 50; //indexing out of bounds

I'm not going to try to sort that out for you. It seems evident to me that your for-loop, coupled with the way you are calculating the index, is going beyond the end of the array. You can follow the methodology discussed here if needed.

CUDA in C: How to fix Error 11 with cudaMemcpyAsync

Question

1 answers

solution1
2 ACCPTED 2019-05-29 16:17:07

CUDA in C: How to fix Error 11 with cudaMemcpyAsync

Question

1 answers

solution1 2 ACCPTED 2019-05-29 16:17:07

solution1
2 ACCPTED 2019-05-29 16:17:07