简体   繁体   中英

How can I access my constant memory in my kernel?

I can't manage to access the data in my constant memory and I don't know why. Here is a snippet of my code:

#define N 10
__constant__ int constBuf_d[N];

__global__ void foo( int *results, int *constBuf )
{
    int tdx = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + tdx;

    if( idx < N )
    {
         results[idx] = constBuf[idx];
    }
}

// main routine that executes on the host
int main(int argc, char* argv[])
{
    int *results_h = new int[N];
    int *results_d = NULL;

    cudaMalloc((void **)&results_d, N*sizeof(int));

    int arr[10] = { 16, 2, 77, 40, 12, 3, 5, 3, 6, 6 };

    int *cpnt;
    cudaError_t err = cudaGetSymbolAddress((void **)&cpnt, "constBuf_d");

    if( err )
        cout << "error!";

    cudaMemcpyToSymbol((void**)&cpnt, arr, N*sizeof(int), 0, cudaMemcpyHostToDevice);

    foo <<< 1, 256 >>> ( results_d, cpnt );

    cudaMemcpy(results_h, results_d, N*sizeof(int), cudaMemcpyDeviceToHost);

    for( int i=0; i < N; ++i )
        printf("%i ", results_h[i] );
}

For some reason, I only get "0" in results_h. I'm running CUDA 4.0 with a card with capability 1.1.

Any ideas? Thanks!

If you add proper error checking to your code, you will find that the cudaMemcpyToSymbol is failing with a invalid device symbol error. You either need to pass the symbol by name, or use cudaMemcpy instead. So this:

cudaGetSymbolAddress((void **)&cpnt, "constBuf_d");
cudaMemcpy(cpnt, arr, N*sizeof(int), cudaMemcpyHostToDevice); 

or

cudaMemcpyToSymbol("constBuf_d", arr, N*sizeof(int), 0, cudaMemcpyHostToDevice);

or

cudaMemcpyToSymbol(constBuf_d, arr, N*sizeof(int), 0, cudaMemcpyHostToDevice);

will work. Having said that, passing a constant memory address as an argument to a kernel is the wrong way to use constant memory - it defeats the compiler from generating instructions to access memory via the constant memory cache. Compare the compute capability 1.2 PTX generated for your kernel:

    .entry _Z3fooPiS_ (
        .param .u32 __cudaparm__Z3fooPiS__results,
        .param .u32 __cudaparm__Z3fooPiS__constBuf)
    {
    .reg .u16 %rh<4>;
    .reg .u32 %r<12>;
    .reg .pred %p<3>;
    .loc    16  7   0
$LDWbegin__Z3fooPiS_:
    mov.u16     %rh1, %ctaid.x;
    mov.u16     %rh2, %ntid.x;
    mul.wide.u16    %r1, %rh1, %rh2;
    cvt.s32.u16     %r2, %tid.x;
    add.u32     %r3, %r2, %r1;
    mov.u32     %r4, 9;
    setp.gt.s32     %p1, %r3, %r4;
    @%p1 bra    $Lt_0_1026;
    .loc    16  14  0
    mul.lo.u32  %r5, %r3, 4;
    ld.param.u32    %r6, [__cudaparm__Z3fooPiS__constBuf];
    add.u32     %r7, %r6, %r5;
    ld.global.s32   %r8, [%r7+0];
    ld.param.u32    %r9, [__cudaparm__Z3fooPiS__results];
    add.u32     %r10, %r9, %r5;
    st.global.s32   [%r10+0], %r8;
$Lt_0_1026:
    .loc    16  16  0
    exit;
$LDWend__Z3fooPiS_:
    } // _Z3fooPiS_

with this kernel:

__global__ void foo2( int *results )
{
    int tdx = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + tdx;

    if( idx < N )
    {
         results[idx] = constBuf_d[idx];
    }
}

which produces

    .entry _Z4foo2Pi (
        .param .u32 __cudaparm__Z4foo2Pi_results)
    {
    .reg .u16 %rh<4>;
    .reg .u32 %r<12>;
    .reg .pred %p<3>;
    .loc    16  18  0
$LDWbegin__Z4foo2Pi:
    mov.u16     %rh1, %ctaid.x;
    mov.u16     %rh2, %ntid.x;
    mul.wide.u16    %r1, %rh1, %rh2;
    cvt.s32.u16     %r2, %tid.x;
    add.u32     %r3, %r2, %r1;
    mov.u32     %r4, 9;
    setp.gt.s32     %p1, %r3, %r4;
    @%p1 bra    $Lt_1_1026;
    .loc    16  25  0
    mul.lo.u32  %r5, %r3, 4;
    mov.u32     %r6, constBuf_d;
    add.u32     %r7, %r5, %r6;
    ld.const.s32    %r8, [%r7+0];
    ld.param.u32    %r9, [__cudaparm__Z4foo2Pi_results];
    add.u32     %r10, %r9, %r5;
    st.global.s32   [%r10+0], %r8;
$Lt_1_1026:
    .loc    16  27  0
    exit;
$LDWend__Z4foo2Pi:
    } // _Z4foo2Pi

Note that in the second case, constBuf_d is accessed via ld.const.s32 , rather than ld.global.s32 , so that constant memory cache is used.

Excellent answer @talonmies. But I would like to mention that there have been changes in cuda 5. In the function MemcpyToSymbol(), char * argument is no longer supported.

The CUDA 5 release notes read:

** The use of a character string to indicate a device symbol, which was possible with certain API functions, is no longer supported. Instead, the symbol should be used directly.

Instead the copy have to be made to the constant memory as follows :

cudaMemcpyToSymbol( dev_x, x, N * sizeof(float) );

In this case "dev_x" is pointer to constant memory and "x" is pointer to host memory which needs to be copied into dev_x.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM