OpenCL Reduction示例的结果不准确

Question

I am working with the OpenCL reduction example provided by Apple here 我正在使用Apple 在此处提供的OpenCL减少示例

After a few days of dissecting it, I understand the basics; 经过几天的剖析，我了解了基础知识。 I've converted it to a version that runs more or less reliably on c++ (Openframeworks) and finds the largest number in the input set. 我已经将其转换为可以在c ++（Openframeworks）上或多或少可靠地运行的版本，并在输入集中找到了最大的数字。

However, in doing so, a few questions have arisen as follows: 但是，这样做时出现了一些问题，如下所示：

why are multiple passes used? 为什么要使用多次通行证？ the most I have been able to cause the reduction to require is two; 我能够导致减少的最大要求是两个； the latter pass only taking a very low number of elements and so being very unsuitable for an openCL process (ie wouldn't it be better to stick to a single pass and then process the results of that on the cpu?) 后一遍仅占用很少的元素，因此非常不适合openCL进程（即，坚持一遍然后在cpu上处理该结果会更好吗？）
when I set the 'count' number of elements to a very high number (24M and up) and the type to a float4, I get inaccurate (or totally wrong) results. 当我将“计数”元素的数量设置为非常高的数量（24M及以上）并将类型设置为float4时，我得到的结果不准确（或完全错误）。 Why is this? 为什么是这样？
in the openCL kernels, can anyone explain what is being done here: 在openCL内核中，任何人都可以在这里解释正在执行的操作：

while (i < n){
        int a = LOAD_GLOBAL_I1(input, i);
        int b = LOAD_GLOBAL_I1(input, i + group_size);
        int s = LOAD_LOCAL_I1(shared, local_id);
        STORE_LOCAL_I1(shared, local_id, (a + b + s));
        i += local_stride;
}

as opposed to what is being done here? 而不是在这里做什么？

#define ACCUM_LOCAL_I1(s, i, j) \
 { \
    int x = ((__local int*)(s))[(size_t)(i)]; \
    int y = ((__local int*)(s))[(size_t)(j)]; \
    ((__local int*)(s))[(size_t)(i)] = (x + y); \
 }

Thanks! 谢谢！ S 小号

Answer 1

To answer the first 2 questions: 要回答前两个问题：

 why are multiple passes used?

Reducing millions of elements to a few thousands can be done in parallel with a device utilization of almost 100%. 将几百万个元素减少到几千个可以同时实现几乎100％的设备利用率。 But the final step is quite tricky. 但是最后一步非常棘手。 So, instead of keeping everything in one shot and have multiple threads idle, Apple implementation decided to do a first pass reduction; 因此，Apple实施决定减少一次通过，而不是让所有事情一目了然，并使多个线程处于空闲状态。 then adapt the work items to the new reduction problem, and finally completing it. 然后使工作项适应新的归约问题，最后完成它。 Ii is a very specific optimization for OpenCL, but it may not be for C++. ii是针对OpenCL的非常具体的优化，但可能不适用于C ++。

when I set the 'count' number of elements to a very high number (24M and up) and the type to a float4, I get inaccurate (or totally wrong) results. 当我将“计数”元素的数量设置为非常高的数量（24M及以上）并将类型设置为float4时，我得到的结果不准确（或完全错误）。 Why is this? 为什么是这样？

A float32 precision is 2^23 the remainder. float32精度是余数的2 ^ 23。 Values higher than 24M = 1.43 x 2^24 (in float representation), have an error in the range +/-(2^24/2^23)/2 ~= 1. 高于24M = 1.43 x 2 ^ 24（以浮点数表示）的值的误差范围为+/-（2 ^ 24/2 ^ 23）/ 2〜= 1。

That means, if you do: 这意味着，如果您这样做：

 float A=24000000;
 float B= A + 1; //~1 error here

The operator error is in the range of the data, therefore... big errors if you repeat that in a loop! 操作员错误在数据范围内，因此...如果您在循环中重复操作，则会出现大错误！

This will not happen in 64bits CPUs, because the 32bits float math uses internally 48bits precision, therefore avoiding these errors. 这不会在64位CPU中发生，因为32位浮点运算在内部使用48位精度，因此避免了这些错误。 However if you get the float close to 2^48 they will happen as well. 但是，如果浮点数接近2 ^ 48，它们也会发生。 But that is not the typical case for normal "counting" integers. 但这不是普通的“计数”整数的典型情况。

Answer 2

The problem is with the precision of 32 bit floats. 问题在于32位浮点数的精度。 You're not the first person to ask about this either. 您也不是第一个问这个的人。 OpenCL reduction result wrong with large floats 大浮点数的OpenCL减少结果错误

OpenCL Reduction示例的结果不准确

问题描述

2 个解决方案

解决方案1
1 2015-07-21 18:08:20

解决方案2
0 2015-07-21 18:34:40

OpenCL Reduction示例的结果不准确

问题描述

2 个解决方案

解决方案1 1 2015-07-21 18:08:20

解决方案2 0 2015-07-21 18:34:40

解决方案1
1 2015-07-21 18:08:20

解决方案2
0 2015-07-21 18:34:40