简体   繁体   English

在OpenCl中模拟动态内存分配

[英]simulating dynamic memory allocation in OpenCl

I ran into a problem which is making me crazy. 我遇到了一个使我疯狂的问题。 I need to simulate dynamic memory allocation in OpenCl kernel. 我需要在OpenCl内核中模拟动态内存分配。 In this regard, I have the following malloc function defined in a *.cl file: 在这方面,我在* .cl文件中定义了以下malloc函数:

 __global void* malloc(size_t size, __global byte *heap, __global uint *next)
{
  uint index = atomic_add(next, size);
  return heap+index;
}

In the host program, I dynamically dedicate a large array of type cl_uchar for this virtual heap as follows: 在主机程序中,我为该虚拟堆动态分配了一个cl_uchar类型的大型数组,如下所示:

int MAX_NUM_OF_HEADERS_PROCESSED_IN_PARALLEL = 1000;
cl_uchar* heap = new cl_byte[1000000];
cl_uint  *next  =  new cl_uint;
*next = 0;
cl_uint * test_result =
        new cl_uint[MAX_NUM_OF_HEADERS_PROCESSED_IN_PARALLEL];
cl_mem memory[3]= { 0, 0, 0};
cl_int error;

memory[0] = clCreateBuffer(GPU_context,
CL_MEM_READ_WRITE, sizeof(cl_uchar) * MAX_HEAP_SIZE, NULL,
NULL);

memory[1] = clCreateBuffer(GPU_context, CL_MEM_READ_WRITE, sizeof(cl_uint), NULL,
        &error);

memory[2] = clCreateBuffer(GPU_context, CL_MEM_READ_WRITE,
            sizeof(cl_uint) * MAX_NUM_OF_HEADERS_PROCESSED_IN_PARALLEL, NULL,
            &error);
clEnqueueWriteBuffer(command_queue, memory[0], CL_TRUE, 0,
        sizeof(cl_uchar) * MAX_HEAP_SIZE, heap, 0, NULL, NULL);

clEnqueueWriteBuffer(command_queue, memory[1], CL_TRUE, 0, sizeof(cl_uint),
        next, 0, NULL, NULL);
error = 0;
error |= clSetKernelArg(kernel, 0, sizeof(cl_mem), &memory[0]);
error |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &memory[1]);
error |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &memory[2]);

size_t globalWorkSize[1] = { MAX_NUM_OF_HEADERS_PROCESSED_IN_PARALLEL };
size_t localWorkSize[1] = { 1 };


error = 0;
error = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,
        globalWorkSize, localWorkSize, 0, NULL, NULL);

I also have the following kernel: 我也有以下内核:

__kernel void packet_routing2(__global byte* heap_, __global uint* next, __global uint* test_result){

    int gid = get_global_id(0);

    __global uint*xx[100];

    for ( int i = 0 ; i < 100; i ++)
    {
        xx[i] = (__global uint*) malloc(sizeof(uint),heap_,next);
        *xx[i] = i*gid;

    result[gid] = *(xx[0]);
}   

I encounterd the following error when I run the program: 运行程序时遇到以下错误:

" %27 = load i32 addrspace(1)* %26, align 4, !tbaa !17
Illegal pointer which is not from a valid memory space.
Aborting..."

Could you please help me fix this issue. 您能帮我解决这个问题吗? I also found out that if xx has only 10 elements, instead of 100, the code works well !!!! 我还发现,如果xx只有10个元素而不是100个元素,则代码效果很好!

Edit: Simplest solution: add a padding value to 'size' before malloc so all struct types (that are lesser in size than max-padding) receive necessary alignment conditions. 编辑:最简单的解决方案:在malloc之前将填充值添加到'size',以便所有结构类型(其大小小于max-padding)都收到必要的对齐条件。

0=struct footprint in memory 0 =内存中的结构占用空间

*=heap * =堆

_=padding _ =填充

***000_____*****0000____****0_______****00000___*****0000000_*******00______***
      |
      v
 save this unused padded memory space in its thread to use later.

it is important that first/starting address value needs to satisfy maximum alignment requirements. 重要的是,首地址/起始地址值必须满足最大对齐要求。 If there is a struct 256-byte long, it should have multiple of 256 to start. 如果有一个256字节长的结构,则它应该以256的倍数开始。

struct size      malloc size    minimum 'next' value (address, not offset)
  1-4                 4            multiple of 4
  5-8                 8            multiple of 8
  9-16                16           multiple of 16
  17-32               32            32*k
  33-64               64            64*k

if there is 64-byte struct, even an int needs 64-byte malloc size now. 如果有64字节的struct,那么即使现在一个int也需要64字节的malloc大小。 Maybe you can save that values locally per thread to use that remaining unused areas. 也许您可以在每个线程本地保存该值,以使用剩余的未使用区域。

So it doesnt give alignment errors and probably works faster for those don't. 因此,它不会产生对齐错误,并且对于那些没有对齐错误的用户,可能会更快地工作。

Also float3 needs 16 byte natively. float3本机也需要16个字节。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM