cudamemcpyasync and streams behaviour understanding

Question

I have this simple code shown below which is doing nothing but just copies some data to the device from host using the streams. But I am confused after running the nvprof as to cudamemcpyasync is really async and understanding of the streams.

#include <stdio.h>

#define NUM_STREAMS 4
cudaError_t memcpyUsingStreams (float           *fDest,
                                float           *fSrc,
                                int             iBytes,
                                cudaMemcpyKind  eDirection,
                                cudaStream_t    *pCuStream)
{
    int             iIndex = 0 ;
    cudaError_t     cuError = cudaSuccess ;
    int             iOffset = 0 ;

    iOffset = (iBytes / NUM_STREAMS) ;
    /*Creating streams if not present */
    if (NULL == pCuStream)
    {
            pCuStream = (cudaStream_t *) malloc(NUM_STREAMS * sizeof(cudaStream_t));
            for (iIndex = 0 ; iIndex < NUM_STREAMS; iIndex++)
            {
                    cuError = cudaStreamCreate (&pCuStream[iIndex]) ;
            }
    }

    if (cuError != cudaSuccess)
    {
            cuError = cudaMemcpy (fDest, fSrc, iBytes, eDirection) ;
    }
    else
    {
            for (iIndex = 0 ; iIndex < NUM_STREAMS; iIndex++)
            {
                    iOffset = iIndex * iOffset ;
                    cuError = cudaMemcpyAsync (fDest +  iOffset , fSrc + iOffset, iBytes / NUM_STREAMS , eDirection, pCuStream[iIndex]) ;
            }
    }

    if (NULL != pCuStream)
    {
            for (iIndex = 0 ; iIndex < NUM_STREAMS; iIndex++)
            {
                    cuError = cudaStreamDestroy (pCuStream[iIndex]) ;
            }
            free (pCuStream) ;
    }
    return cuError ;
}


int main()
{
    float *hdata = NULL ;
    float *ddata = NULL ;
    int i, j, k, index ;
    cudaStream_t *abc = NULL ;

    hdata = (float *) malloc (sizeof (float) * 256 * 256 * 256) ;

    cudaMalloc ((void **) &ddata, sizeof (float) * 256 * 256 * 256) ;

    for (i=0 ; i< 256 ; i++)
    {
        for (j=0; j< 256; j++)
        {
            for (k=0; k< 256 ; k++)
            {
                index = (((i * 256) + j) * 256) + k;
                hdata [index] = index ;
            }
        }
    }

    memcpyUsingStreams (ddata, hdata, sizeof (float) * 256 * 256 * 256,  cudaMemcpyHostToDevice, abc) ;

    cudaFree (ddata) ;
    free (hdata) ;

    return 0;
}

The nvprof results are as below.

    Start  Duration           Grid Size     Block Size     Regs*    SSMem*    DSMem*      Size  Throughput    Device   Context    Stream  Name
 104.35ms   10.38ms                   -              -         -         -         -   16.78MB    1.62GB/s         0         1         7  [CUDA memcpy HtoD]
 114.73ms   10.41ms                   -              -         -         -         -   16.78MB    1.61GB/s         0         1         8  [CUDA memcpy HtoD]
 125.14ms   10.46ms                   -              -         -         -         -   16.78MB    1.60GB/s         0         1         9  [CUDA memcpy HtoD]
 135.61ms   10.39ms                   -              -         -         -         -   16.78MB    1.61GB/s         0         1        10  [CUDA memcpy HtoD]

So I didnt understand the point of using the streams here because of the start time. It looks sequential to me. Please help me to understand as what I am doing wrong here. I am using tesla K20c card.

Answer 1

The PCI Express link that connects your GPU to the system only has one channel going to the card and one channel coming from the card. That means at most, you can have a single cudaMemcpy(Async) operation that is actually executing at any given time, per direction (ie one DtoH and one HtoD, at the most). All other cudaMemcpy(Async) operations will get queued up, waiting for those ahead to complete.

You cannot have two operations going in the same direction at the same time. One at a time, per direction.

As @JackOLantern states, the principal benefit for streams is to overlap memcopies and compute, or else to allow multiple kernels to execute concurrently. It also allows one DtoH copy to run concurrently with one HtoD copy .

Since your program does all HtoD copies, they all get executed serially. Each copy has to wait for the copy ahead of it to complete.

Even getting an HtoD and DtoH memcopy to execute concurrently requires a device with multiple copy engines; you can discover this about your device using deviceQuery.

I should also point out, to enable concurrent behavior, you should use cudaHostAlloc , not malloc , for your host side buffers.

EDIT: The answer above has GPUs in view that have at most 2 copy engines (one per direction) and is still correct for such GPUs. However there exist some newer Pascal and Volta family member GPUs that have more than 2 copy engines. In that case, with 2 (or more) copy engines per direction, it is theoretically possible to have 2 (or more) transfers "in-flight" in that direction. However this doesn't change the characteristics of the PCIE (or NVLink) bus itself. You are still limited to the available bandwidth, and the exact low level behavior (whether such transfers appear to be "serialized" or else appear to run concurrently, but take longer due to sharing of bandwidth) should not matter much in most cases.

cudamemcpyasync and streams behaviour understanding

Question

1 answers

solution1
1 ACCPTED 2013-09-19 12:19:14

cudamemcpyasync and streams behaviour understanding

Question

1 answers

solution1 1 ACCPTED 2013-09-19 12:19:14

solution1
1 ACCPTED 2013-09-19 12:19:14