OpenCL multiple GPU integral - segfault when changing global size from 32 to 64

I have created kernel function that computes integral from certain range and adds result to variable (one variable per GPU) and in host I add them all and I have a result of integral (in this case x^2dx) and for range 0-8 my result is 170,666... which is true. I was using global work size 1, 2, 4, 8, 16, 32 and it worked for all of them but for some reason when I change GWS to 64 I have segmentation fault. I have 1 platform (contains 8 GPU cards) each device have its own queue, context, kernel.

Here are few lines from my code:

Im creating 3 buffers which I passes later to kernel (third one is for reading result).

cl_mem bufferA[deviceNumber];
cl_mem bufferB[deviceNumber];
cl_mem bufferC[deviceNumber];
for(int i = 0; i< deviceNumber; i++){
    bufferA[i] = clCreateBuffer(context[i], CL_MEM_READ_WRITE , sizeof(float) * global_size, NULL, &error);
    bufferB[i] = clCreateBuffer(context[i], CL_MEM_READ_ONLY , sizeof(float) * global_size, NULL, &error);
    bufferC[i] = clCreateBuffer(context[i], CL_MEM_WRITE_ONLY, sizeof(float) * global_size, NULL, &error);

later after creating and building program i set kernel args.

    for(int i = 0; i< deviceNumber; i++){
        error = clSetKernelArg(kernel[i], 0, sizeof(cl_mem), (void*)&bufferA[i]);
        error = clSetKernelArg(kernel[i], 1, sizeof(cl_mem), (void*)&bufferB[i]);
        error = clSetKernelArg(kernel[i], 2, sizeof(cl_mem), (void*)&bufferC[i]);
        error = clSetKernelArg(kernel[i], 3, sizeof(cl_int), (void*)&global_size);

and enqueuing writeBuffers

for(int i = 0; i< deviceNumber; i++){
    error = clEnqueueWriteBuffer(commandQueue[i], bufferA[i], CL_FALSE, 0, sizeof(float) * global_size, a, 0, NULL, NULL);
    error = clEnqueueWriteBuffer(commandQueue[i], bufferB[i], CL_FALSE, 0, sizeof(float) * global_size, &b[i], 0, NULL, NULL);

enqueuing kernels to do their jobs.

for(int i = 0; i< deviceNumber; i++){
    error = clEnqueueNDRangeKernel(commandQueue[i], kernel[i], 1, NULL, &global_size, &localWorkSize, 0, NULL, NULL);

and finally place where segfault occurs:

for(int i = 0; i< deviceNumber; i++){
    std::cout<<"clEnqueueReadBuffer: "<<error<<std::endl;
    error = clEnqueueReadBuffer(commandQueue[i], bufferC[i], CL_TRUE, 0, sizeof(float) * global_size, &c[i], 0, NULL, NULL);

I am printing error codes everywhere and there are all 0 and last thing I see in output is that string just before clEnqueueReadBuffer so it crashes in first iteration in for loop.

Does anyone know what am I missing here?

found the fault!

sizeof(float) * global_size

it was ok for reading vector which size was equal to global_size but after reforging code to integral I totally forgot about that, if you read one variable per device you need only sizeof(type) nothing more. Hope it will help someone

