简体   繁体   English

促进并行还原OpenCL

[英]boosting parallel reduction OpenCL

I have an algorithm, performing two-staged parallel reduction on GPU to find the smallest elemnt in a string. 我有一种算法,在GPU上执行两阶段并行归约,以找到字符串中最小的元素。 I know that there is a hint on how to make it work faster, but I don't know what it is. 我知道关于如何使其更快运行的提示,但我不知道它是什么。 Any ideas on how I can tune this kernel to speed my program up? 关于如何调整此内核以加快程序速度的任何想法? It is not necessary to actually change algorithm, may be there are other tricks. 不必实际更改算法,可能还有其他技巧。 All ideas are welcome. 欢迎所有想法。

Thank you! 谢谢!

__kernel
void reduce(__global float* buffer,
            __local float* scratch,
            __const int length,
            __global float* result) {    
    int global_index = get_global_id(0);
    float accumulator = INFINITY
        while (global_index < length) {
            float element = buffer[global_index];
            accumulator = (accumulator < element) ? accumulator : element;
            global_index += get_global_size(0);
    }
    int local_index = get_local_id(0);
    scratch[local_index] = accumulator;
    barrier(CLK_LOCAL_MEM_FENCE);
    for(int offset = get_local_size(0) / 2;
        offset > 0;
        offset = offset / 2) {
            if (local_index < offset) {
                float other = scratch[local_index + offset];
                float mine = scratch[local_index];
                scratch[local_index] = (mine < other) ? mine : other;
            }
        barrier(CLK_LOCAL_MEM_FENCE);
    }
    if (local_index == 0) {
        result[get_group_id(0)] = scratch[0];
    }
}
accumulator = (accumulator < element) ? accumulator : element;

Use fmin function - it is exactly what you need, and it may result in faster code (call to built-in instruction, if available, instead of costly branching) 使用fmin函数-正是您所需要的,它可能会导致更快的代码(调用内置指令(如果有),而不是昂贵的分支)

global_index += get_global_size(0);

What is your typical get_global_size(0) ? 您典型的get_global_size(0)什么?

Though your access pattern is not very bad (it is coalesced, 128byte chunks for 32-warp) - it is better to access memory sequentially whenever possible. 尽管您的访问模式不是很糟糕(它是合并的,用于32-warp的128字节块)-最好尽可能地顺序访问内存。 For instance, sequential access may aid memory prefetching (note, OpenCL code can be executed on any device, including CPU). 例如,顺序访问可以帮助进行内存预取 (请注意, OpenCL代码可以在包括CPU在内的任何设备上执行)。

Consider following scheme: each thread would process range 考虑以下方案:每个线程将处理范围

[ get_global_id(0)*delta ,  (get_global_id(0)+1)*delta )

It will result in fully sequential access. 这将导致完全顺序访问。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM