内核OpenCL中FIFO实现的最佳方法

Question

Goal: Implement the diagram shown below in OpenCL. 目标：在OpenCL中实现下面显示的图表。 The main thing needed from the OpenCl kernel is to multiply the coefficient array and temp array and then accumilate all those values into one at the end. OpenCl内核需要的主要功能是将系数数组和临时数组相乘，然后将所有这些值累加到最后。 (That is probably the most time intensive operation, parallelism would be really helpful here). （这可能是最耗时的操作，并行性在这里真的很有帮助）。

I am using a helper function for the kernel that does the multiplication and addition (I am hoping this function will be parallel as well). 我正在使用内核的辅助函数进行乘法和加法（我希望这个函数也是并行的）。

Description of the picture: 图片说明：

One at a time , the values are passed into the array (temp array) which is the same size as the coefficient array. 一次一个 ，将值传递到与系数数组大小相同的数组（临时数组）中。 Now every time a single value is passed into this array, the temp array is multiplied with the coefficient array in parallel and the values of each index are then concatenated into one single element. 现在，每次将单个值传递到此数组时，temp数组将与系数数组并行相乘，然后将每个索引的值连接成一个单独的元素。 This will continue until the input array reaches it's final element. 这将继续，直到输入数组到达它的最后一个元素。

What happens with my code? 我的代码怎么了？

For 60 elements from the input, it takes over 8000 ms!! 对于来自输入的60个元素，它需要超过8000毫秒!! and I have a total of 1.2 million inputs that still have to be passed in. I know for a fact that there is a way better solution to do what I am attempting. 我总共需要输入120万个输入。我知道有一个更好的解决方案来做我正在尝试的事情。 Here is my code below. 这是我的代码如下。

Here are some things that I know are wrong with he code for sure. 以下是一些我知道他确定错误的问题。 When I try to multiply the coefficient values with the temp array, it crashes. 当我尝试将系数值与临时数组相乘时，它会崩溃。 This is because of the global_id. 这是因为global_id。 All I want this line to do is simply multiply the two arrays in parallel. 我想要这条线做的只是简单地将两个数组并行。

I tried to figure out why it was taking so long to do the FIFO function, so I started commenting lines out. 我试图找出为什么花了这么长时间来执行FIFO功能，所以我开始评论线路。 I first started by commenting everything except the first for loop of the FIFO function. 我首先开始评论除FIFO函数的第一个for循环之外的所有内容。 As a result this took 50 ms. 结果这耗时50毫秒。 Then when I uncommented the next loop, it jumped to 8000ms. 然后当我取消注释下一个循环时，它跳到了8000ms。 So the delay would have to do with the transfer of data. 因此，延迟将与数据传输有关。

Is there a register shift that I could use in OpenCl? 我可以在OpenCl中使用寄存器移位吗？ Perhaps use some logical shifting method for integer arrays? 也许对整数数组使用一些逻辑移位方法？ (I know there is a '>>' operator). （我知道有一个'>>'运算符）。

float constant temp[58];
float constant tempArrayForShift[58];
float constant multipliedResult[58];

float fifo(float inputValue, float *coefficients, int sizeOfCoeff) {

//take array of 58 elements (or same size as number of coefficients)
//shift all elements to the right one
//bring next element into index 0 from input
//multiply the coefficient array with the array thats the same size of coefficients and accumilate
//store into one output value of the output array
//repeat till input array has reached the end

int globalId = get_global_id(0); 

float output = 0.0f;

//Shift everything down from 1 to 57
//takes about 50ms here
for(int i=1; i<58; i++){
    tempArrayForShift[i] = temp[i];
}

//Input the new value passed from main kernel. Rest of values were shifted over so element is written at index 0.
tempArrayForShift[0] = inputValue;
//Takes about 8000ms with this loop included
//Write values back into temp array
for(int i=0; i<58; i++){
    temp[i] = tempArrayForShift[i];
}

//all 58 elements of the coefficient array and temp array are multiplied at the same time and stored in a new array
//I am 100% sure this line is crashing the program.
//multipliedResult[globalId] = coefficients[globalId] * temp[globalId];

//Sum the temp array with each other. Temp array consists of coefficients*fifo buffer
for (int i = 0; i <  58; i ++) {
//  output = multipliedResult[i] + output;
}

//Returned summed value of temp array
return output;
}


__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) { 

//Initialize the temporary array values to 0
for (int i = 0; i <  58; i ++) {
    temp[i] = 0;
    tempArrayForShift[i] = 0;
    multipliedResult[i] = 0;
}

//fifo adds one element in and calls the fifo function. ALL I NEED TO DO IS SEND ONE VALUE AT A TIME HERE.
for (int i = 0; i <  60; i ++) {
    Output[i] = fifo(Array[i], coefficients, 58);
}

}

I have had this problem with OpenCl for a long time. 我很长一段时间都遇到过OpenCl这个问题。 I am not sure how to implement parallel and sequential instructions together. 我不确定如何一起实现并行和顺序指令。

Another alternative I was thinking about 我想到的另一种选择

In the main cpp file, I was thinking of implementing the fifo buffer there and having the kernel do the multiplication and addition. 在主cpp文件中，我正在考虑在那里实现fifo缓冲区并让内核进行乘法和加法。 But this would mean I would have to call the kernel 1000+ times in a loop. 但这意味着我必须在循环中将内核调用1000次以上。 Would this be the better solution? 这会是更好的解决方案吗？ Or would it just be completely inefficient. 或者它只是完全没有效率。

Answer 1

To get good performance out of GPU, you need to parallelize your work to many threads. 要从GPU中获得良好性能，您需要将您的工作并行化为多个线程。 In your code you are just using a single thread and a GPU is very slow per thread but can be very fast, if many threads are running at the same time. 在您的代码中，您只使用单个线程，并且每个线程的GPU速度非常慢，但如果许多线程同时运行，则速度非常快。 In this case you can use a single thread for each output value. 在这种情况下，您可以为每个输出值使用单个线程。 You do not actually need to shift values through a array: For every output value a window of 58 values is considered, you can just grab these values from memory, multiply them with the coefficients and write back the result. 您实际上不需要通过数组移动值：对于每个输出值，考虑58个值的窗口，您可以从内存中获取这些值，将它们与系数相乘并写回结果。

A simple implementation would be (launch with as many threads as output values): 一个简单的实现是（使用尽可能多的线程作为输出值启动）：

__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) 
{ 
    int globalId = get_global_id(0); 
    float sum=0.0f;
    for (int i=0; i< 58; i++)
    {
        float tmp=0;
        if (globalId+i > 56)
        {
            tmp=Array[i+globalId-57]*coefficient[57-i];
        }
        sum += tmp;
    }
    output[globalId]=sum;
}

This is not perfect, as the memory access patterns it generates are not optimal for GPUs. 这并不完美，因为它生成的内存访问模式对GPU来说并不是最佳的。 The Cache will likely help a bit, but there is clearly a lot of room for optimization, as the values are reused several times. 缓存可能会有所帮助，但显然有很多优化空间，因为这些值会重复使用几次。 The operation you are trying to perform is called convolution (1D). 您尝试执行的操作称为卷积（1D）。 NVidia has an 2D example called oclConvolutionSeparable in their GPU Computing SDK, that shows an optimized version. NVidia在其GPU计算SDK中有一个名为oclConvolutionSeparable的2D示例，该示例显示了优化版本。 You adapt use their convolutionRows kernel for a 1D convolution. 你可以使用他们的convolutionRows内核进行一维卷积。

Answer 2

Here's another kernel you can try out. 这是你可以尝试的另一个内核。 There are a lot of synchronization points (barriers), but this should perform fairly well. 有很多同步点（障碍），但这应该表现得相当好。 The 65-item work group is not very optimal. 65项工作组不是最优的。

the steps: 步骤：

init local values to 0 init本地值为0
copy coefficients to local variable 将系数复制到局部变量

looping over the output elements to compute: 循环输出元素以计算：

shift existing elements (work items > 0 only) 移动现有元素（工作项> 0）
copy new element (work item 0 only) 复制新元素（仅限工作项0）
compute dot product 计算点积
5a. 5A。 multiplication - one per work item 乘法 - 每个工作项一个
5b. 5B。 reduction loop to compute sum 减少循环来计算总和
copy dot product to output (WI 0 only) 将点积复制到输出（仅限WI 0）
final barrier 最后的障碍

the code: 编码：

__kernel void lowpass(__global float *Array, __constant float *coefficients, __global float *Output, __local float *localArray, __local float *localSums){

    int globalId = get_global_id(0);
    int localId = get_local_id(0);  
    int localSize = get_local_size(0);  

    //1  init local values to 0
    localArray[localId] = 0.0f

    //2  copy coefficients to local
    //don't bother with this id __constant is working for you
    //requires another local to be passed in: localCoeff
    //localCoeff[localId] = coefficients[localId];

    //barrier for both steps 1 and 2
    barrier(CLK_LOCAL_MEM_FENCE);

    float tmp;
    for(int i = 0; i< outputSize; i++)
    {
        //3  shift elements (+barrier)
        if(localId > 0){
            tmp = localArray[localId -1]
        }
        barrier(CLK_LOCAL_MEM_FENCE);
        localArray[localId] = tmp

        //4  copy new element (work item 0 only, + barrier)
        if(localId == 0){
            localArray[0] = Array[i];
        }
        barrier(CLK_LOCAL_MEM_FENCE);

        //5  compute dot product
        //5a multiply + barrier
        localSums[localId] = localArray[localId] * coefficients[localId];
        barrier(CLK_LOCAL_MEM_FENCE);
        //5b reduction loop + barrier
        for(int j = 1; j < localSize; j <<= 1) {
            int mask = (j << 1) - 1;
            if ((localId & mask) == 0) {
                localSums[local_index] += localSums[localId +j]
            }
            barrier(CLK_LOCAL_MEM_FENCE);
        }

        //6 copy dot product (WI 0 only)
        if(localId == 0){
            Output[i] = localSums[0];
        }

        //7 barrier
        //only needed if there is more code after the loop.
        //the barrier in #3 covers this in the case where the loop continues
        //barrier(CLK_LOCAL_MEM_FENCE);
    }

}

What about more work groups? 更多的工作组呢？
This is slightly simplified to allow a single 1x65 work group computer the entire 1.2M Output. 这稍微简化了一个1x65工作组计算机的整个1.2M输出。 To allow multiple work groups, you could use / get_num_groups(0) to calculate the amount of work each group should do (workAmount), and adjust the i for-loop: 要允许多个工作组，可以使用/ get_num_groups（0）计算每个组应该执行的工作量（workAmount），并调整i for循环：

for (i = workAmount * get_group_id(0); i< (workAmount * (get_group_id(0)+1) -1); i++)

Step #1 must be changed as well to initialize to the correct starting state for localArray, rather than all 0s. 必须更改步骤＃1以初始化为localArray的正确启动状态，而不是全部为0。

    //1  init local values
    if(groupId == 0){
        localArray[localId] = 0.0f
    }else{
        localArray[localSize - localId] = Array[workAmount - localId];
    }

These two changes should allow you to use a more optimal number of work groups; 这两个更改应该允许您使用更多数量的工作组; I suggest some multiple of the number of compute units on the device. 我建议设备上有多个计算单元。 Try to keep the amount of work for each group in the thousands though. 尽量保持每组数千人的工作量。 Play around with this, sometimes what seems optimal on a high-level will be detrimental to the kernel when it's running. 玩这个，有时在高级别上看起来最优的内核在内核运行时会对内核有害。

Advantages 好处
At almost every point in this kernel, the work items have something to do. 在这个内核的几乎每一点上，工作项都有一些事情要做。 The only time fewer than 100% of the items are working is during the reduction loop in step 5b. 只有少于100％的项目正在工作的时间是在步骤5b中的减少循环期间。 Read more here about why that is a good thing. 在这里阅读更多关于为什么这是一件好事。

Disadvantages 缺点
The barriers will slow down the kernel just by the nature of what barriers do: the pause a work item until the others reach that point. 障碍只会通过障碍的性质来减缓内核：暂停工作项直到其他人达到这一点。 Maybe there is a way you could implement this with fewer barriers, but I still feel this is optimal because of the problem you are trying to solve. 也许有一种方法可以用更少的障碍实现这一点，但我仍然认为这是最佳的，因为你试图解决的问题。
There isn't room for more work items per group, and 65 is not a very optimal size. 每组没有更多工作项目的空间，65不是一个非常优化的大小。 Ideally, you should try to use a power of 2, or a multiple of 64. This won't be a huge issue though, because there are a lot of barriers in the kernel which makes them all wait fairly regularly. 理想情况下，你应该尝试使用2的幂或64的倍数。但这不会是一个大问题，因为内核中有很多障碍使得它们都经常等待。

内核OpenCL中FIFO实现的最佳方法

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-05-25 16:17:18

解决方案2
1 2016-05-30 20:31:56

内核OpenCL中FIFO实现的最佳方法

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-05-25 16:17:18

解决方案2 1 2016-05-30 20:31:56

解决方案1
2 已采纳 2016-05-25 16:17:18

解决方案2
1 2016-05-30 20:31:56