简体   繁体   English

如何使用仿函数对循环进行矢量化?

[英]How do I vectorize loop with functors?

I use some classes which use some containers to store data; 我使用一些使用一些容器来存储数据的类; there are classes with multidimension containers. 有多维容器的类。 These classes overload operator () to index data. 这些类重载operator ()以索引数据。 I use such objects a lot in loops and want to vectorize them. 我在循环中使用了很多这样的对象,并希望对它们进行矢量化。 GCC is not able to vectorize them directly; 海湾合作委员会无法直接对其进行矢量化; it says "No SLP opportunities found in basic block" and dismisses vectorization. 它说“在基本块中找不到SLP机会”并且驳回了矢量化。
How would I go about vectorizing my code? 我将如何进行矢量化代码?

I haven't checked with other compilers yet as I want this to be vectorizable by few of the prominent compilers in use. 我还没有与其他编译器一起检查,因为我希望这可以被少数使用中的着名编译器进行矢量化。

First of all, I agree with the comment that says you must "manage your memory very close" if you intend to successfully vectorize your loop. 首先,我同意评论说如果你打算成功地对你的循环进行矢量化,你必须“非常接近地管理你的记忆”。 In case don't know about that - see footnote on the end of this answer for a very brief and superficial introduction about memory aligment. 如果不知道这一点 - 请参阅本答案末尾的脚注,以获得关于内存对齐的简短而肤浅的介绍。

However, even if your memory is well aligned, there is another possibility that maybe hold you back. 然而,即使你的记忆很好地对齐,也有可能让你退缩。 Georg Hager e Gerhard Wellein, authors of the respected book "Introduction to High Performance Computing for Scientists and Engineer", explicit state that C++ operator overloading may prevent loop vectorization Georg Hager和Gerhard Wellein是受人尊敬的着作“科学家和工程师高性能计算简介”的作者,他明确表示C ++运算符重载可能会阻止循环向量化

In their own words : 用他们自己的话说:

"(....) STL may define this operator in the following way (adapted from the GNU ISO C++ library source): “(....)STL可以通过以下方式定义此运算符(改编自GNU ISO C ++库源):

const T& operator[](size_t __n) const{ return *(this->_M_impl._M_start + __n); } 

Although this looks simple enough to be inlined efficiently, current compilers refuse to apply SIMD vectorization to the summation loop above. 虽然这看起来很简单,可以有效地内联,但是当前的编译器拒绝将SIMD矢量化应用于上面的求和循环。 A single layer of abstraction, in this case an overloaded index operator, can thus prevent the creation of optimal loop code." 单层抽象,在这种情况下是一个重载索引操作符,因此可以防止创建最佳循环代码。“

A good friend convinced me that this is not actually true for stl containers anymore, because compilers can eliminate the layer of indirection associated with operator[] . 一位好朋友让我相信,对于stl容器来说,这实际上并不正确,因为编译器可以消除与operator[]相关的间接层。 But, it seems that you wrote your own container, so you must check if compiler can eliminate the layer of indirection associated with your own operator() ! 但是,您似乎编写了自己的容器,因此必须检查编译器是否可以消除与您自己的operator()关联的间接层! A good cross check is to provide yourself a way to work directly with the underlying array that your container holds (meaning: write a member function similar to std::vector.data() and use the C pointers as an "iterator" inside your loop ). 一个好的交叉检查是为你自己提供一种方法来直接处理你的容器所拥有的底层数组(意思是:编写一个类似于std::vector.data()的成员函数,并使用C指针作为你内部的“迭代器”循环)。

Footnote about memory alignment: 关于内存对齐的脚注:

Problem: assume you want to vectorize c[i] = a[i] + b[i] . 问题:假设你想要矢量化c[i] = a[i] + b[i]

First fact: size(double) = 8 bytes = 64 bits. 第一个事实: size(double) = 8个字节= 64位。

Second fact: There is an assembly instruction that reads 2 doubles in memory and put them on 128 bits register => with one assembly instruction you can read 2 doubles => they can read a[0] and a[1] then b[0] and b[1] ! 第二个事实:有一个汇编指令在内存中读取2个双精度数并将它们放在128位寄存器=>一个汇编指令,你可以读取2个双精度数据=>它们可以读取a[0]a[1]然后b[0]b[1]

Third fact: When you apply the instruction on the register, you make 2 sums of 64 bits double at the same time. 第三个事实:当您在寄存器上应用指令时,您可以同时将两个64位的和double

The problem is that assembly can only read a[0] and a[1] at the same time only if a[0] and b[0] are in memory addresses that are multiple of 16 (if they are not, he can test if a[1] and b[1] is alignment and so forth). 问题是只有当a[0]b[0]的内存地址是16的倍数时,程序集才能同时读取a[0]a[1] (如果不是,他可以测试如果a[1]b[1]是对齐等等)。 That is why memory can be an issue that prevents vectorization. 这就是为什么内存可能成为阻止矢量化的问题。 To fix that, you must write container allocators that guarantees that the first element of your container will be written on a memory address that is multiple of 16. 要解决这个问题,您必须编写容器分配器,以确保容器的第一个元素将写入16的倍数的内存地址。

Update: This answer provides a detailed explanation about how to code an allocator that aligns your memory. 更新: 此答案提供了有关如何编码对齐内存的分配器的详细说明。

Update 2: another useful answer for learning how to code allocators 更新2:学习如何编码分配器的另一个有用的答案

Update 3: alternative approach using posix_memalign/_aligned_malloc 更新3:使用posix_memalign / _aligned_malloc的替代方法

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM