显式预取非连续数据

Question

I do a lot of operations on sub-regions of images. 我对图像的子区域做了很多操作。 For example, If I have a 100x100 image, I might want to iterate over this image and process blocks of 10x10 pixels. 例如，如果我有一个100x100的图像，则可能要遍历该图像并处理10x10像素的块。 For example: 例如：

for(each 10x10 block)
{
  for(each pixel in the block)
  {
    do something
  }
}

The problem with this is that the small blocks are not contiguous chunks of memory (ie the image pixels are stored in row major order, so when I access the 10x10 block, the pixels in each row of the block are contiguous, but the rows of the block are not contiguous. Is there anything that can be done to speed up the access to the pixels in these blocks? Or is it just impossible to get fast access to a region of a data structure like this? 这样做的问题是，小块不是连续的内存块（即，图像像素按行主顺序存储，因此当我访问10x10块时，该块每一行中的像素都是连续的，但是有没有可以做的事情来加快对这些块中像素的访问速度；还是不可能快速访问像这样的数据结构区域？

From a lot of reading I did, it sounded like something like first reading the pixels as the only operation in a loop might be useful: 从我的大量阅读中，听起来像是先读取像素，因为循环中唯一的操作可能会有用：

// First read the pixels
vector<float> vals(numPixels);
for(pixels in first row)
{
val[i] = pixels[i];
}

// Then do the operations on the pixels
for(elements of vals)
{
 doSomething(vals[i])
}

versus what I'm doing which is both simultaneously just: 与我正在做的事情同时只是：

// Read and operate on the pixels
for(pixels in first row)
{
 doSomething(pixels[i])
}

but I was unable to find any actual code examples (versus theoretical explanation) of how to do this. 但是我找不到如何执行此操作的任何实际代码示例（相对于理论解释）。 Is there any truth to this? 有没有道理呢？

Answer 1

gcc has a builtin functioncalled __builtin_prefetch . gcc有一个内置函数__builtin_prefetch 。 You can pass an address to that function, and on targets that support it, gcc will emit a machine instruction causing that address to be loaded into cache even though it isn't used immediately. 您可以将一个地址传递给该函数，并且在支持该函数的目标上， gcc会发出一条机器指令，即使该地址没有立即使用，它也会将该地址加载到缓存中。

Many modern image-processing applications store images in tiles , as opposed to the rows (aka *scanlines) you describe. 许多现代的图像处理应用程序将图像存储在图块中 ，而不是您描述的行（也称为*扫描线）。 Eg GIMP does that . 例如GIMP就是这样做的。 So if you have control over the way the image is stored, then using a tiled approach will likely increase locality and therefore reduce cache misses and improve performance. 因此，如果您可以控制图像的存储方式，则使用平铺方法可能会增加局部性，从而减少缓存丢失并提高性能。

显式预取非连续数据

问题描述

1 个解决方案

解决方案1
1 2012-10-20 18:41:21

显式预取非连续数据

问题描述

1 个解决方案

解决方案1 1 2012-10-20 18:41:21

解决方案1
1 2012-10-20 18:41:21