OpenCL内核执行速度比单线程慢

Question

All, I wrote a very simple OpenCL kernel which transforms an RGB image to gray scale using simple averaging. 总之，我写了一个非常简单的OpenCL内核，它使用简单的平均将RGB图像转换为灰度。

Some background: 一些背景：

The image is stored in mapped memory, as a 24 bit, non padded memory block 图像存储在映射的存储器中，作为24位非填充存储器块
The output array is stored in pinned memory (mapped with clEnqueueMapBuffer ) and is 8 bpp 输出数组存储在固定内存中（使用clEnqueueMapBuffer映射）并且为8 bpp
There are two buffers allocated on the device ( clCreateBuffer ), one is specifically read (which we clWriteBuffer into before the kernel starts) and the other is specifically write (which we clReadBuffer after the kernel finishes) 在设备上分配了两个缓冲区（ clCreateBuffer ），一个是专门读取的（我们在内核启动之前clWriteBuffer ），另一个是专门写入的（在内核完成后我们clReadBuffer ）

I am running this on a 1280x960 image. 我在1280x960图像上运行它。 A serial version of the algorithm averages 60ms, the OpenCL kernel averages 200ms!!! 该算法的串行版本平均为60ms，OpenCL内核平均为200ms！ I'm doing something wrong but I have no idea how to proceed, what to optimize. 我做错了什么，但我不知道如何继续，优化什么。 (Timing my reads/writes without a kernel call, the algorithm runs in 15ms) （在没有内核调用的情况下调整我的读/写，算法在15ms内运行）

I am attaching the kernel setup (sizes and arguments) as well as the kernel 我附加了内核设置（大小和参数）以及内核

EDIT : So I wrote an even dumber kernel, that does no global memory accesses inside it, and it was only 150ms... This is still ridiculously slow. 编辑：所以我写了一个偶数dumber内核，它内部没有全局内存访问，它只有150ms ...这仍然是非常慢的。 I thought maybe I'm messing up with global memory reads, they have to be 4 byte aligned or something? 我想也许我搞乱全局内存读取，它们必须是4字节对齐或什么？ Nope... 不...

Edit 2: Removing the all parameters from my kernel gave me significant speed up... I'm confused I thought that since I'm clEnqueueWriteBuffer the kernel should be doing no memory transfer from host->device and device->host.... 编辑2：从我的内核中删除所有参数给了我显着的加速...我很困惑我认为因为我是clEnqueueWriteBuffer内核应该没有从主机 - >设备和设备 - >主机进行内存传输。 ..

Edit 3: Figured it out, but I still don't understand why. 编辑3：想出来，但我仍然不明白为什么。 If anyone could explain it I would be glad to award correct answer to them. 如果有人能解释，我很乐意给他们正确答案。 The problem was passing the custom structs by value. 问题是按值传递自定义结构。 It looks like I'll need to allocate a global memory location for them and pass their cl_mem s 看起来我需要为它们分配一个全局内存位置并传递它们的cl_mem

Kernel Call: 内核调用：

//Copy input to device
result = clEnqueueWriteBuffer(handles->queue, d_input_data, CL_TRUE, 0, h_input.widthStep*h_input.height, (void *)input->imageData, 0, 0, 0);
if(check_result(result, "opencl_rgb_to_gray", "Failed to write to input buffer on device!")) return 0;

//Set kernel arguments
result = clSetKernelArg(handles->current_kernel, 0, sizeof(OpenCLImage), (void *)&h_input);
if(check_result(result, "opencl_rgb_to_gray", "Failed to set input struct.")) return 0;
result = clSetKernelArg(handles->current_kernel, 1, sizeof(cl_mem), (void *)&d_input_data);
if(check_result(result, "opencl_rgb_to_gray", "Failed to set input data.")) return 0;
result = clSetKernelArg(handles->current_kernel, 2, sizeof(OpenCLImage), (void *)&h_output);
if(check_result(result, "opencl_rgb_to_gray", "Failed to set output struct.")) return 0;
result = clSetKernelArg(handles->current_kernel, 3, sizeof(cl_mem), (void *)&d_output_data);
if(check_result(result, "opencl_rgb_to_gray", "Failed to set output data.")) return 0;

//Determine run parameters
global_work_size[0] = input->width;//(unsigned int)((input->width / (float)local_work_size[0]) + 0.5);
global_work_size[1] = input->height;//(unsigned int)((input->height/ (float)local_work_size[1]) + 0.5);

printf("Global Work Group Size: %d %d\n", global_work_size[0], global_work_size[1]);

//Call kernel
result = clEnqueueNDRangeKernel(handles->queue, handles->current_kernel, 2, 0, global_work_size, local_work_size, 0, 0, 0);
if(check_result(result, "opencl_rgb_to_gray", "Failed to run kernel!")) return 0;

result = clFinish(handles->queue);
if(check_result(result, "opencl_rgb_to_gray", "Failed to finish!")) return 0;

//Copy output
result = clEnqueueReadBuffer(handles->queue, d_output_data, CL_TRUE, 0, h_output.widthStep*h_output.height, (void *)output->imageData, 0, 0, 0);
if(check_result(result, "opencl_rgb_to_gray", "Failed to write to output buffer on device!")) return 0;

Kernel: 核心：

typedef struct OpenCLImage_t
{
    int width;
    int widthStep;
    int height;
    int channels;
} OpenCLImage;

__kernel void opencl_rgb_kernel(OpenCLImage input, __global unsigned char*  input_data, OpenCLImage output, __global unsigned char * output_data)
{
    int pixel_x = get_global_id(0);
    int pixel_y = get_global_id(1);
    unsigned char * cur_in_pixel, *cur_out_pixel;
    float avg = 0;

    cur_in_pixel = (unsigned char *)(input_data + pixel_y*input.widthStep + pixel_x * input.channels);
    cur_out_pixel = (unsigned char *)(output_data + pixel_y*output.widthStep + pixel_x * output.channels);

    avg += cur_in_pixel[0];
    avg += cur_in_pixel[1];
    avg+= cur_in_pixel[2];
    avg /=3.0f;

    if(avg > 255.0)
        avg = 255.0;
    else if(avg < 0)
        avg = 0;

    *cur_out_pixel = avg;
}

Answer 1

Overhead of copying the value to all the threads that will be created might be the possible reason for the time; 将值复制到将要创建的所有线程的开销可能是时间的可能原因; where as for a global memory the reference will be enough in the other case. 对于全局存储器，在另一种情况下引用就足够了。 The only the SDK implementer will be able to answer exactly.. :) 唯一的SDK实现者将能够准确回答.. :)

Answer 2

You may want to try a local_work_size like [64, 1, 1] in order to coalesce your memory calls. 你可能想尝试像[64,1,1]这样的local_work_size来合并你的内存调用。 (note that 64 is a diviser of 1280). （注意64是1280的加法器）。

As previously said, you have to use a profiler in order to get more informations. 如前所述，您必须使用分析器才能获得更多信息。 Are you using an nvidia card ? 你在使用nvidia卡吗？ Then download CUDA 4 (not 5), as it contains an openCL profiler. 然后下载CUDA 4（不是5），因为它包含一个openCL分析器。

Your performance must be far from the optimum. 你的表现必须远离最佳状态。 Change the local work size, the global work size, try to treat two or four pixels per tread. 更改本地工作大小，全局工作大小，尝试每个胎面处理两个或四个像素。 Can you change the way pixels are stored privous to your treatment? 你能改变像素的储存方式吗？ Then break your struct for tree arrays in order to coalesce memomry access more effectively. 然后打破树形数组的结构，以便更有效地合并memomry访问。

Tou can hide your memory transfers with the GPU work: it will be more easy to do with a profiler near you. 使用GPU可以隐藏你的内存传输：使用你附近的探查器会更容易。

OpenCL内核执行速度比单线程慢

问题描述

2 个解决方案

解决方案1
4 已采纳 2013-04-04 06:04:23

解决方案2
0 2013-04-06 10:11:57

OpenCL内核执行速度比单线程慢

问题描述

2 个解决方案

解决方案1 4 已采纳 2013-04-04 06:04:23

解决方案2 0 2013-04-06 10:11:57

解决方案1
4 已采纳 2013-04-04 06:04:23

解决方案2
0 2013-04-06 10:11:57