并行约简算法中的OpenCL未定义行为

Question

I am working on a simple parallel reduction algorithm to find the minimum value in an array and am coming across some interesting undefined behavior in my algorithm. 我正在研究一种简单的并行约简算法，以查找数组中的最小值，并且在我的算法中遇到了一些有趣的未定义行为。 I am running Intel's OpenCL 1.2 on Ubuntu 16.04. 我在Ubuntu 16.04上运行Intel的OpenCL 1.2。

The following kernel is what I am trying to run which is currently giving me the wrong answer: 以下是我正在尝试运行的内核，它当前给了我错误的答案：

__kernel void Find_Min(int arraySize, __global double* scratch_arr, __global double* value_arr, __global double* min_arr){

    const int index = get_global_id(0);
    int length = (int)sqrt((double)arraySize);
    int start = index*length;
    double min_val = INFINITY;
    for(int i=start; i<start+length && i < arraySize; i++){
        if(value_arr[i] < min_val)
            min_val = value_arr[i];
    }
    scratch_arr[index] = min_val;

    barrier(CLK_GLOBAL_MEM_FENCE);
    if(index == 0){
        double totalMin = min_val;
        for(int i=1; i<length; i++){
            if(scratch_arr[i] < totalMin)
                totalMin = scratch_arr[i];
        }
        min_arr[0] = totalMin;
    }
}

When in put in an array that is {0,-1,-2,-3,-4,-5,-6,-7,-8} it ends up returning -2. 当放入{0，-1，-2，-3，-4，-5，-6，-7，-8}数组时，它最终返回-2。

Here is where the undefined behavior comes in. When I run the following kernel with a printf statement before the barrier I get the right answer (-8): 这是出现未定义行为的地方。当我在障碍之前使用printf语句运行以下内核时，我得到了正确的答案（-8）：

__kernel void Find_Min(int arraySize, __global double* scratch_arr, __global double* value_arr, __global double* min_arr){

    const int index = get_global_id(0);
    int length = (int)sqrt((double)arraySize);
    int start = index*length;
    double min_val = INFINITY;
    for(int i=start; i<start+length && i < arraySize; i++){
        if(value_arr[i] < min_val)
            min_val = value_arr[i];
    }
    scratch_arr[index] = min_val;
    printf("setting scratch[%i] to %f\n", index, min_val);

    barrier(CLK_GLOBAL_MEM_FENCE);
    if(index == 0){
        double totalMin = min_val;
        for(int i=1; i<length; i++){
            if(scratch_arr[i] < totalMin)
                totalMin = scratch_arr[i];
        }
        min_arr[0] = totalMin;
    }
}

The only thing I can think of that could be happening is that I am using the barrier command incorrectly and all the printf is doing is causing a delay in the kernel that is somehow synchronizing the calls so they all complete before the final reduction step. 我唯一能想到的可能是发生了错误的操作：我使用了不正确的barrier命令，而printf所做的所有事情都导致内核延迟，从而以某种方式同步了调用，因此它们都在最后的缩减步骤之前完成了。 But without the printf, the kernel 0 executes the final reduction before the other kernels are finished. 但是如果没有printf，内核0将在其他内核完成之前执行最后的归约。

Does anyone else have any suggestions or tips on how to debug this issue? 是否有人对如何调试此问题有任何建议或提示？

Thanks in advance!! 提前致谢！！

Answer 1

The problem was that the kernel was being launched with one thread per workgroup and barriers only work within a work group. 问题是内核在每个工作组中只有一个线程启动，并且屏障仅在工作组中起作用。 See this response to a similar question: Open CL no synchronization despite barrier 看到类似问题的答复：尽管存在障碍，但打开CL仍不同步

并行约简算法中的OpenCL未定义行为

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-01-16 01:33:13

并行约简算法中的OpenCL未定义行为

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-01-16 01:33:13

解决方案1
0 已采纳 2017-01-16 01:33:13