简体   繁体   English

如何在内核中访问常量内存?

[英]How can I access my constant memory in my kernel?

I can't manage to access the data in my constant memory and I don't know why. 我无法设法访问我的常量内存中的数据,我不知道为什么。 Here is a snippet of my code: 这是我的代码片段:

#define N 10
__constant__ int constBuf_d[N];

__global__ void foo( int *results, int *constBuf )
{
    int tdx = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + tdx;

    if( idx < N )
    {
         results[idx] = constBuf[idx];
    }
}

// main routine that executes on the host
int main(int argc, char* argv[])
{
    int *results_h = new int[N];
    int *results_d = NULL;

    cudaMalloc((void **)&results_d, N*sizeof(int));

    int arr[10] = { 16, 2, 77, 40, 12, 3, 5, 3, 6, 6 };

    int *cpnt;
    cudaError_t err = cudaGetSymbolAddress((void **)&cpnt, "constBuf_d");

    if( err )
        cout << "error!";

    cudaMemcpyToSymbol((void**)&cpnt, arr, N*sizeof(int), 0, cudaMemcpyHostToDevice);

    foo <<< 1, 256 >>> ( results_d, cpnt );

    cudaMemcpy(results_h, results_d, N*sizeof(int), cudaMemcpyDeviceToHost);

    for( int i=0; i < N; ++i )
        printf("%i ", results_h[i] );
}

For some reason, I only get "0" in results_h. 出于某种原因,我只在results_h中得到“0”。 I'm running CUDA 4.0 with a card with capability 1.1. 我正在使用功能为1.1的卡运行CUDA 4.0。

Any ideas? 有任何想法吗? Thanks! 谢谢!

If you add proper error checking to your code, you will find that the cudaMemcpyToSymbol is failing with a invalid device symbol error. 如果您为代码添加了正确的错误检查,您会发现cudaMemcpyToSymbol失败并出现无效的设备符号错误。 You either need to pass the symbol by name, or use cudaMemcpy instead. 您需要按名称传递符号,或使用cudaMemcpy So this: 所以这:

cudaGetSymbolAddress((void **)&cpnt, "constBuf_d");
cudaMemcpy(cpnt, arr, N*sizeof(int), cudaMemcpyHostToDevice); 

or 要么

cudaMemcpyToSymbol("constBuf_d", arr, N*sizeof(int), 0, cudaMemcpyHostToDevice);

or 要么

cudaMemcpyToSymbol(constBuf_d, arr, N*sizeof(int), 0, cudaMemcpyHostToDevice);

will work. 将工作。 Having said that, passing a constant memory address as an argument to a kernel is the wrong way to use constant memory - it defeats the compiler from generating instructions to access memory via the constant memory cache. 话虽如此,将常量内存地址作为参数传递给内核是使用常量内存的错误方法 - 它使编译器无法通过常量内存缓存生成访问内存的指令。 Compare the compute capability 1.2 PTX generated for your kernel: 比较为您的内核生成的计算能力1.2 PTX:

    .entry _Z3fooPiS_ (
        .param .u32 __cudaparm__Z3fooPiS__results,
        .param .u32 __cudaparm__Z3fooPiS__constBuf)
    {
    .reg .u16 %rh<4>;
    .reg .u32 %r<12>;
    .reg .pred %p<3>;
    .loc    16  7   0
$LDWbegin__Z3fooPiS_:
    mov.u16     %rh1, %ctaid.x;
    mov.u16     %rh2, %ntid.x;
    mul.wide.u16    %r1, %rh1, %rh2;
    cvt.s32.u16     %r2, %tid.x;
    add.u32     %r3, %r2, %r1;
    mov.u32     %r4, 9;
    setp.gt.s32     %p1, %r3, %r4;
    @%p1 bra    $Lt_0_1026;
    .loc    16  14  0
    mul.lo.u32  %r5, %r3, 4;
    ld.param.u32    %r6, [__cudaparm__Z3fooPiS__constBuf];
    add.u32     %r7, %r6, %r5;
    ld.global.s32   %r8, [%r7+0];
    ld.param.u32    %r9, [__cudaparm__Z3fooPiS__results];
    add.u32     %r10, %r9, %r5;
    st.global.s32   [%r10+0], %r8;
$Lt_0_1026:
    .loc    16  16  0
    exit;
$LDWend__Z3fooPiS_:
    } // _Z3fooPiS_

with this kernel: 有了这个内核:

__global__ void foo2( int *results )
{
    int tdx = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + tdx;

    if( idx < N )
    {
         results[idx] = constBuf_d[idx];
    }
}

which produces 哪个产生

    .entry _Z4foo2Pi (
        .param .u32 __cudaparm__Z4foo2Pi_results)
    {
    .reg .u16 %rh<4>;
    .reg .u32 %r<12>;
    .reg .pred %p<3>;
    .loc    16  18  0
$LDWbegin__Z4foo2Pi:
    mov.u16     %rh1, %ctaid.x;
    mov.u16     %rh2, %ntid.x;
    mul.wide.u16    %r1, %rh1, %rh2;
    cvt.s32.u16     %r2, %tid.x;
    add.u32     %r3, %r2, %r1;
    mov.u32     %r4, 9;
    setp.gt.s32     %p1, %r3, %r4;
    @%p1 bra    $Lt_1_1026;
    .loc    16  25  0
    mul.lo.u32  %r5, %r3, 4;
    mov.u32     %r6, constBuf_d;
    add.u32     %r7, %r5, %r6;
    ld.const.s32    %r8, [%r7+0];
    ld.param.u32    %r9, [__cudaparm__Z4foo2Pi_results];
    add.u32     %r10, %r9, %r5;
    st.global.s32   [%r10+0], %r8;
$Lt_1_1026:
    .loc    16  27  0
    exit;
$LDWend__Z4foo2Pi:
    } // _Z4foo2Pi

Note that in the second case, constBuf_d is accessed via ld.const.s32 , rather than ld.global.s32 , so that constant memory cache is used. 注意,在第二种情况下, constBuf_d经由访问ld.const.s32 ,而不是ld.global.s32 ,以便使用恒定存储器高速缓存。

Excellent answer @talonmies. 优秀答案@talonmies。 But I would like to mention that there have been changes in cuda 5. In the function MemcpyToSymbol(), char * argument is no longer supported. 但我想提一下cuda 5中有变化。在函数MemcpyToSymbol()中,不再支持char *参数。

The CUDA 5 release notes read: CUDA 5发行说明如下:

** The use of a character string to indicate a device symbol, which was possible with certain API functions, is no longer supported. Instead, the symbol should be used directly.

Instead the copy have to be made to the constant memory as follows : 相反,必须按如下方式对副本进行复制:

cudaMemcpyToSymbol( dev_x, x, N * sizeof(float) );

In this case "dev_x" is pointer to constant memory and "x" is pointer to host memory which needs to be copied into dev_x. 在这种情况下,“dev_x”是指向常量存储器的指针,“x”是指向需要复制到dev_x的主机存储器的指针。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM