简体   繁体   English

矢量化是什么意思?

[英]What does vectorization mean?

Is it a good idea to vectorize the code? 向量化代码是一个好主意吗? What are good practices in terms of when to do it? 在什么时候这样做有什么好的做法? What happens underneath? 下面会发生什么?

Vectorization means that the compiler detects that your independent instructions can be executed as one SIMD instruction. 向量化意味着编译器检测到您的独立指令可以作为一条SIMD指令执行。 Usual example is that if you do something like 通常的例子是,如果你做了类似的事情

for(i=0; i<N; i++){
  a[i] = a[i] + b[i];
}

It will be vectorized as (using vector notation) 它将被矢量化为(使用矢量符号)

for (i=0; i<(N-N%VF); i+=VF){
  a[i:i+VF] = a[i:i+VF] + b[i:i+VF];
}

Basically the compiler picks one operation that can be done on VF elements of the array at the same time and does this N/VF times instead of doing the single operation N times. 基本上,编译器选择一个可以同时在阵列的VF元素上完成的操作,并执行N / VF次,而不是单次操作N次。

It increases performance, but puts more requirement on the architecture. 它提高了性能,但对架构提出了更高的要求。

As mentioned above, vectorization is used to make use of SIMD instructions, which can perform identical operations of different data packed into large registers. 如上所述,矢量化用于利用SIMD指令,SIMD指令可以执行打包到大寄存器中的不同数据的相同操作。

A generic guideline to enable a compiler to autovectorize a loop is to ensure that there are no flow- and anti-dependencies b/w data elements in different iterations of a loop. 使编译器能够自动向量化循环的通用准则是确保在循环的不同迭代中没有流和反依赖性的b / w数据元素。

http://en.wikipedia.org/wiki/Data_dependency http://en.wikipedia.org/wiki/Data_dependency

Some compilers like the Intel C++/Fortran compilers are capable of autovectorizing code. 一些编译器,如英特尔C ++ / Fortran编译器,能够自动生成代码。 In case it was not able to vectorize a loop, the Intel compiler is capable of reporting why it could not do that. 如果无法对循环进行矢量化,英特尔编译器就能够报告为什么它不能这样做。 There reports can be used to modify the code such that it becomes vectorizable (assuming it's possible) 有报告可用于修改代码,使其变得可矢量化(假设它是可能的)

Dependencies are covered in depth in the book 'Optimizing Compilers for Modern Architectures: A Dependence-based Approach' “为现代架构优化编译器:基于依赖的方法”一书深入介绍了依赖关系

It's SSE code Generation. 这是SSE代码生成。

You have a loop with float matrix code in it matrix1[i][j] + matrix2[i][j] and the compiler generates SSE code. 你有一个带有浮点矩阵代码的循环:matrix1 [i] [j] + matrix2 [i] [j],编译器生成SSE代码。

Vectorization need not be limited to single register which can hold large data. 矢量化不必限于可以容纳大数据的单个寄存器。 Like using '128' bit register to hold '4 x 32' bit data. 就像使用'128'位寄存器来保存'4 x 32'位数据一样。 It depends on architectural limitations. 这取决于架构限制。 Some architecture have different execution units which have registers of their own. 某些体系结构具有不同的执行单元,这些执 In that case, a part of the data can be fed to that execution unit and the result can be taken from a register corresponding to that execution unit. 在这种情况下,可以将一部分数据馈送到该执行单元,并且可以从对应于该执行单元的寄存器获取结果。

For example, consider the below case. 例如,考虑以下情况。

for(i=0; i < N; i++) for(i = 0; i <N; i ++)
{ {
a[i] = a[i] + b[i]; a [i] = a [i] + b [i];
} }



If I am working on an architecture which has two execution units, then my vector size is defined as two. 如果我正在开发一个有两个执行单元的架构,那么我的矢量大小定义为两个。 The loop mentioned above will be reframed as 上面提到的循环将被重新定义为

for(i=0; i<(N/2); i+=2) for(i = 0; i <(N / 2); i + = 2)
{ {
a[i] = a[i] + b[i] ; a [i] = a [i] + b [i];


a[i+1] = a[i+1] + b[i+1]; a [i + 1] = a [i + 1] + b [i + 1];
} }

NOTE: The 2 inside the for statement is derived from the vector size. 注意:for语句中的2是从向量大小派生的。

As I am having two execution units the two statements inside the loop will be fed into the two execution units. 由于我有两个执行单元,循环内的两个语句将被送入两个执行单元。 The sum will be accumulated in the execution units separately. 总和将分别在执行单元中累计。 Finally the sum of accumulated values (from two execution units) will be carried out. 最后,将执行累积值的总和(来自两个执行单元)。

The good practices are 好的做法是
1. The constraints like dependency (between different iterations of the loop) needs to be checked before vectorizing the loop. 1.在向量化循环之前,需要检查依赖性(在循环的不同迭代之间)之类的约束。
2. Function calls needs to be prevented. 2.需要防止函数调用。
3. Pointer access can create aliasing and it needs to be prevented. 3.指针访问可以创建别名,需要防止它。

Maybe also have a look at libSIMDx86 (source code). 也许还看看libSIMDx86(源代码)。

A nice example well explained is: 一个很好的例子很好地解释了:

Choosing to Avoid Branches: A Small Altivec Example 选择避免分支:小Altivec示例

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM