优化OpenCL缓冲区写入？

Question

I am having a major bottleneck when writing to a buffer. 写入缓冲区时，我遇到了主要瓶颈。

What I want to do is very simple. 我想做的很简单。 First of all, I am using two global ids (I am using an image2d). 首先，我正在使用两个全局ID（我正在使用image2d）。 Each thread reads a 9 pixel values, the pixel at position (x,y) and its 8 neighbors, basically a 3x3 square block. 每个线程读取一个9像素值，位置（x，y）处的像素及其8个相邻像素，基本上是一个3x3的正方形块。 This work is done by each thread. 这项工作由每个线程完成。 Now, I calculate some values and I want to write to the output buffer the results of each thread. 现在，我计算一些值，并想将每个线程的结果写入输出缓冲区。

There are 64 values produced be each thread and I write them to the output buffer, so that means that the output buffer is of size (rows*cols*64). 每个线程产生64个值，我将它们写入输出缓冲区，这意味着输出缓冲区的大小为（rows * cols * 64）。 I also wanted to support calculations that support up to 640 values but obviously each thread writing 640 values to a buffer is impossible because of the VRAM required. 我还想支持最多支持640个值的计算，但是显然每个线程都不可能将640个值写入缓冲区，因为需要VRAM。

I must say that the threads write to different positions, there are no overwrites, that is there would be 64*number_of_threads = 64*global_id(0)*global_id(1) = 64*rows*cols values. 我必须说线程写入不同的位置，没有覆盖，即会有64 * number_of_threads = 64 * global_id（0）* global_id（1）= 64 * rows * cols值。

This is a major bottleneck in my code, I mean the writing of 64 values, I think it has to do with memory bandwidth but I am not so sure. 这是我代码中的主要瓶颈，我的意思是编写64个值，我认为这与内存带宽有关，但我不确定。

What can I do so that each thread can calculate and write 64 values to an output buffer efficiently? 我该怎么做，以便每个线程可以有效地计算64个值并将其写入输出缓冲区？ Is this not possible? 这不可能吗？

My GPU is rx 480 4gb, I know that the (rows*cols*64) size may sometimes be too big to fit the VRAM, but even if it fits, the writing is slow, I thought that the bandwidth was very high in gpus? 我的GPU是rx 480 4gb，我知道（rows * cols * 64）的大小有时可能太大而无法容纳VRAM，但是即使适合，写入速度还是很慢，我认为带宽在gpus中非常高？

there are also other two output buffers, but their size is much smaller so we can ignore them. 还有另外两个输出缓冲区，但是它们的大小要小得多，因此我们可以忽略它们。

In summary what this code does is this 1) read a square block of 9 pixels, where the middle one is the current value. 总之，此代码的作用是：1）读取一个9像素的正方形块，其中中间一个为当前值。

2) multiply the 8 neighbors with the current value, we get 8 values for each pixel. 2）将8个邻居与当前值相乘，每个像素得到8个值。

3)write to the neighb buffer the 8 neighbors. 3）将8个邻居写入邻居缓冲区。

4) write 8*8 values to the Rx buffer. 4）将8 * 8的值写入Rx缓冲区。 This buffer "simulates" the x_* x_^T result, that is a (8x1)x(1x8) matrix multiply of the neighbor values. 该缓冲区“模拟” x_ * x_ ^ T结果，即相邻值的（8x1）x（1x8）矩阵乘积。

note that I am writing to the output buffers in a "transpose form", that is each thread at position (x,y) writes consecutively 64 values at (y,x), (y+1,x),...(y+63,x) this is because: 请注意，我以“转置形式”写入输出缓冲区，也就是说，位置（x，y）的每个线程在（y，x），（y + 1，x），...（ y + 63，x）这是因为：

1) it is the fastest method! 1）这是最快的方法！ The version where I write as (x,y) -> (x+1,y),...(x+63,y) is definitely slower. 我写为（x，y）->（x + 1，y），...（x + 63，y）的版本肯定慢一些。

2) I need it in this form because I am using ArrayFire library which then needs to load the buffer, but it will consume the buffer in row-major order and put the contents inside its array in column-major order, that way there is no need to transpose the array (which will use a lot of vram copies) 2）我需要这种形式的，因为我正在使用ArrayFire库，然后需要加载缓冲区，但是它将以行优先的顺序消耗缓冲区，并将内容按列优先的顺序放入其数组中，这样无需转置数组（将使用大量的vram副本）

Answer 1

First, as you haven't explicitly mentioned it, I'll point out that if you can, use your GPU manufacturer's profiling tools to verify your bottlenecks. 首先，正如您没有明确提及的那样，我将指出，如果可以的话，请使用GPU制造商的性能分析工具来验证您的瓶颈。 Even if something seems like a bottleneck, it might be a red herring. 即使一些看起来像一个瓶颈，这可能是一个红色的鲱鱼。

However, it does sound like global memory writes might be a problem in your kernel. 但是，听起来确实好像全局内存写入可能是内核中的问题。 As you don't give any specifics, I can only point out a few general things to watch out for: 由于您没有提供任何细节，因此我只能指出一些需要注意的一般事项：

1. Memory layout 1.内存布局

It seems each work-item in your setup writes 64 values consecutively in memory. 似乎设置中的每个工作项都在内存中连续写入64个值。 This means that every work-item will be writing to a different cache line, which is almost certainly not optimal. 这意味着每个工作项都将写入不同的缓存行，这几乎肯定不是最佳选择。 If you can change the memory layout of your output, try to arrange it such that work-items write to adjacent memory locations at the same time. 如果可以更改输出的内存布局，请尝试对其进行排列，以使工作项同时写入相邻的内存位置。

For example, you might currently have: 例如，您当前可能具有：

uint output_index = 64 * (get_global_size(0) * get_global_id(1) + get_global_id(0));
for (unsigned i = 0; i < 64; ++i)
{
    output[output_index + i] = calculation(inputs, i);
}

Here, work-item (0, 0) would first write to item 0, then item 1, then item 2, while work-item (1, 0) would first write to item 64, then 65, etc. 在这里，工作项目（0，0）将首先写入项目0，然后写入项目1，然后写入项目2，而工作项目（1，0）首先将写入项目64，然后写入65，依此类推。

It is usually faster if work-item (0, 0) writes to index 0 at the same time as work-item (1, 0) writes to index 1, and so on. 如果工作项（0，0）同时写入工作项（1，0）到索引1，通常更快，依此类推。 So if you can, try laying out your output array so that the value dimension has a higher-order stride, so you can write: 因此，如果可以的话，请尝试对输出数组进行布局，以使值维具有较高的跨度，以便可以编写：

uint stride = get_global_size(0) * get_global_size(1);
uint output_index = (get_global_size(0) * get_global_id(1) + get_global_id(0));
for (unsigned i = 0; i < 64; ++i)
{
    output[output_index] = calculation(inputs, i);
    output_index += stride;
}

2. Using local memory as an intermediate 2.使用本地内存作为中介

If changing the layout of global memory is not an option, you can instead write your results to local memory in an order that would be inefficient for global memory, and then copy it from local memory to global memory efficiently. 如果不能更改全局内存的布局，则可以按对全局内存低效的顺序将结果写入本地内存，然后将其有效地从本地内存复制到全局内存。 By efficiently, I mean that adjacent work-items in your work-group should once again write to adjacent global memory locations. 所谓高效，是指您工作组中的相邻工作项应再次写入相邻的全局内存位置。 You can either do this explicitly or use the async_work_group_copy function in your kernel. 您可以显式地执行此操作，也可以在内核中使用async_work_group_copy函数。

3. Compression 3.压缩

If there's some way to represent your 64 values in a way that's more space-efficient, that can help a lot, especially if you're subsequently sending the results back to the host CPU. 如果有某种方法可以更节省空间地表示您的64个值，则可以有很大帮助，特别是如果您随后将结果发送回主机CPU。 For example, if precision is not so important and range is constrained and you are currently using float s, you could try using half (16-bit) floating-point values, or short / ushort 16-bit integer values for slightly higher precision but less range. 例如，如果精度不是很重要，并且范围受到限制，并且您当前正在使用float ，则可以尝试使用half （16位）浮点值，或者使用short / ushort 16位整数值来获得更高的精度，但是范围较小。 Alternatively, if your values correlate in some way, you could use some other representation, such as a shared exponent. 另外，如果您的值以某种方式相关，则可以使用其他表示形式，例如共享指数。

4. Onward computation on GPU 4. GPU上的继续计算

If you are currently using the result from your computation on the host CPU, you will probably be bound by PCIe bandwidth, which is considerably lower than GPU-to-VRAM bandwidth. 如果当前正在主机CPU上使用计算结果，则可能会受到PCIe带宽的束缚，该带宽大大低于GPU到VRAM的带宽。 In this case, consider moving whatever further computation you are performing onto the GPU, even if the CPU implementation of this itself is not currently a bottleneck. 在这种情况下，请考虑将您正在执行的任何进一步计算移至GPU上，即使此功能本身的CPU实现目前不是瓶颈。 Avoiding the copy from VRAM to system RAM may give you a bigger boost. 避免从VRAM复制到系统RAM可能会给您带来更大的推动力。

Better still, if you can avoid writing this result to global memory altogether, for example by performing the onward computation in the same kernel, possibly after storing the intermediate result in local memory to share it with the work-group, then you can avoid the memory bottleneck altogether. 更好的是，如果您可以避免将结果完全写入全局内存中（例如，在同一个内核中执行向前计算，可能是在将中间结果存储在本地内存中以便与工作组共享之后），则可以避免内存瓶颈。

5. Read the OpenCL optimisation guides 5.阅读OpenCL优化指南

There might be other optimisations you can perform which are specific to your workload. 您可能还会执行其他针对您的工作负载的优化。 As you haven't gone into any detail about what you're doing, we can't easily guess at those optimisations. 由于您尚未详细了解正在执行的操作，因此我们无法轻易地猜测出这些优化。 The GPU manufacturers publish optimisation guides for OpenCL, make sure you read and understand them, and see if you can apply any of the advice to your task. GPU制造商发布了OpenCL优化指南，请确保您已阅读并理解它们，并查看是否可以将任何建议应用于您的任务。

Answer 2

In my experience I have not identified any sort of bottleneck on writing to a buffer within a kernel, which I believed is what you are referring to. 根据我的经验，我没有发现写入内核中缓冲区的任何瓶颈，我相信这就是您所指的。

Writing those 64 values in each thread should not be an issue your bottleneck might be somewhere else. 在每个线程中写入这64个值应该不是您瓶颈所在的问题。 It might be something that is being done before enqueueing the kernel, while preparing the buffer arguments. 在准备缓冲区参数时，可能在入队内核之前已经完成了一些工作。

优化OpenCL缓冲区写入？

问题描述

2 个解决方案

解决方案1
1 2019-08-29 09:03:49

1. Memory layout 1.内存布局

2. Using local memory as an intermediate 2.使用本地内存作为中介

3. Compression 3.压缩

4. Onward computation on GPU 4. GPU上的继续计算

5. Read the OpenCL optimisation guides 5.阅读OpenCL优化指南

解决方案2
0 2019-08-29 22:47:21

优化OpenCL缓冲区写入？

问题描述

2 个解决方案

解决方案1 1 2019-08-29 09:03:49

1. Memory layout 1.内存布局

2. Using local memory as an intermediate 2.使用本地内存作为中介

3. Compression 3.压缩

4. Onward computation on GPU 4. GPU上的继续计算

5. Read the OpenCL optimisation guides 5.阅读OpenCL优化指南

解决方案2 0 2019-08-29 22:47:21

解决方案1
1 2019-08-29 09:03:49

解决方案2
0 2019-08-29 22:47:21