OpenCL variables or array in kernel cost memory?

I am trying to run the following code about OpenCL. In kernel function, I will define an array int arr[1000] = {0};

kernel void test()
    int arr[1000] = {0};

Then I will create N threads to run the kernel.

cl::CommandQueue cmdQueue;
cmdQueue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(N), cl::NullRange); // kernel here is the one running test()

My question is, since we know that OpenCL will parallel run the threads, does it mean that, the peak memory will be N * 1000 * sizeof(int) ?

This is not the way to OpenCL (yes, that's what I meant :).

The kernel function operates on kernel operands passed in from the host (CPU) - so you'd allocate your array on the host using clCreateBuffer and set the arg using clSetKernelArg . Your kernel does not declare/allocate the device memory, but simply receives it as an __global argument. Now when you run the kernel using clEnqueueNDRangeKernel , the OpenCL implementation will allocate 1000 ints and run a thread on each of those ints.

If, on the other hand you meant to allocate 1000 ints per work-item (device thread), your calculation is right (yes, they cost memory from the local pool) but it probably won't work. OpenCL work-items have access to only local memory (see here on how to check this for your device) which is severely limited.

