简体   繁体   English

shuffle / permute内在函数如何适用于256位pd?

[英]How do the shuffle/permute intrinsics work for 256 bit pd?

I'm trying to wrap my head around how the _mm256_shuffle_pd and _mm256_permute_pd intrinsics work. 我试图围绕_mm256_shuffle_pd和_mm256_permute_pd内在函数如何工作。 I can't seem to predict what the results of one of these operations would be. 我似乎无法预测其中一项操作的结果。

First, for _mm_shuffle_ps all is good. 首先,对于_mm_shuffle_ps一切都很好。 The results I get are the one I expect. 我得到的结果是我期待的结果。 For example: 例如:

float b[4] = { 1.12, 2.22, 3.33, 4.44 };

__m128 a = _mm_load_ps(&b[0]);
a = _mm_shuffle_ps(a, a, _MM_SHUFFLE(3, 0, 1, 2));
_mm_store_ps(&b[0], a);
// 3.33 2.22 1.12 4.44

So everything is right here. 所以一切都在这里。 Now I wanted to try this with __m256d that is what I'm currently using in my code. 现在我想用__m256d尝试这个,这是我目前在我的代码中使用的。 From what I've found the _mm256_shuffle_ps/pd intrinsics works differently. 从我发现的_mm256_shuffle_ps / pd内在函数的工作方式不同。

My understanding here is that the control mask is applied two times. 我的理解是控制掩码应用了两次。 The first time on the first half of the 128 bit and the second on the last 128 bit. 第一次在128位的前半部分,第二次在最后的128位。 The first two pairs of control bits are used to choose from the first vector ( and store the values in the first&second word and in the fifth&sixth word of the result vector ) while the highest bit pairs choose from the second vector. 前两对控制位用于从第一矢量中选择(并将值存储在第一和第二字以及结果矢量的第五和第六字中),而最高位对选择第二矢量。 For example: 例如:

float b[8] = { 1.12, 2.22, 3.33, 4.44, 5.55, 6.66, 7.77, 8.88 };

__m256 a = _mm256_load_ps(&b[0]);
a = _mm256_shuffle_ps(a, a, 0b00000111);
_mm256_store_ps(&b[0], a);
// 4.44 2.22 1.12 1.12 8.88 6.66 5.55 5.55

Here the result I expect ( and I actually get ) is { 4.44, 2.22, 1.12, 1.12, 8.88, 6.66, 5.55, 5.55 } 这里我期望(我实际得到的)结果是{ 4.44, 2.22, 1.12, 1.12, 8.88, 6.66, 5.55, 5.55 }

This should work as follows: 这应该如下工作:

在此输入图像描述

( Sorry I'm bad at drawing ). (对不起,我画画很糟糕)。 And the same is done for the second vector ( in this case a again ) using the highest two pairs ( so 00 00 ) and filling the missing spaces. 并且对于使用最高两对(因此00 00)并填充缺失空间的第二矢量(在这种情况下再次)进行相同的操作。

I thought that _mm256_shuffle_pd would work the same way. 我认为_mm256_shuffle_pd会以同样的方式工作。 So if I wanted the first double I would have to move the 00 space and the 01 space to construct it correctly. 因此,如果我想要第一个双倍,我将不得不移动00空间和01空间来正确构造它。

For example: 例如:

__m256d a = _mm256_load_pd(&b[0]);
a = _mm256_shuffle_pd(a, a, 0b01000100);
_mm256_store_pd(&b[0], a);
// 1.12 1.12 4.44 3.33

I would have expected this to output { 1.12, 1.12, 3.33, 3.33 }. 我原以为这会输出{1.12,1.12,3.33,3.33}。 In my head, I'm taking 00 01 ( 1.12 ) and 00 01 { 3.33 } from the first vector and the same from the second with it being the same vector and all. 在我的脑海中,我从第一个向量中获取00 01(1.12)和00 01 {3.33},从第二个向量中获取相同的向量,并且它是相同的向量。

I've tried many combinations for the control mask and I just can't wrap my head around how this is used nor was I able to find somewhere where it was explained in a way I would understand. 我已经尝试了很多控制面具的组合,我只是无法围绕如何使用它,也无法找到以我理解的方式解释它的地方。

So my question is: How does _mm256_shuffle_pd work? 所以我的问题是:_mm256_shuffle_pd如何工作? And how would I get the same result as _mm_shuffle_ps(a, a, _MM_SHUFFLE(3, 0, 2, 1)) with four doubles and a shuffle ( if at all possible)? 我怎样才能得到与_mm_shuffle_ps(a,a,_MM_SHUFFLE(3,0,2,1))相同的结果,包括四个双打和一个随机播放(如果可能的话)?

shufps needs all 8 bits of its immediate just for 4 elements with 4 possible sources each. shufps需要4个元素的所有8位,每个元素有4个可能的源。 So it has no room to grow for 256-bit, and the only option was to replicate the same shuffle in both lanes. 因此它没有空间来增长256位,唯一的选择是在两个通道中复制相同的shuffle。

But 128-bit shufpd only has 2 elements with 2 sources each, thus 2 x 1 bit. 但128位shufpd只有2个元素,每个元素有2个源,因此2 x 1位。 So the AVX version uses 4 bits total, 2 for each lane. 所以AVX版本总共使用4位,每个通道使用2位。 ( It's not lane-crossing, so it's not as powerful as 128-bit shufps .) 不是交叉路口,所以它没有128位shufps那么强大 。)


http://felixcloutier.com/x86/SHUFPD.html has full docs with a diagram, and detailed pseudocode. http://felixcloutier.com/x86/SHUFPD.html上有完整的文档,包含图表和详细的伪代码。 Intel's intrinsics guide for _mm256_shuffle_pd has the same pseudo-code. 英特尔针对_mm256_shuffle_pd的内在指南具有相同的伪代码。

AVX2 http://felixcloutier.com/x86/VPERMPD.html ( _mm256_permute_pd , aka _mm256_permute4x64_pd ) is lane-crossing, and uses its immediate exactly the way 128-bit shufps does: four 2-bit selectors. AVX2 http://felixcloutier.com/x86/VPERMPD.html_mm256_permute_pd ,又名_mm256_permute4x64_pd )是车道交叉,并使用其立即完全相同的方式128位shufps的作用:四个2位选择。


The only lane-crossing 2-source shuffle is vperm2f128 ( _mm256_permute2f128_pd ) , until AVX512F introduces finer granularity vpermt2pd and vpermt2ps (and equivalent integer shuffles. 唯一的交叉2源shuffle是vperm2f128_mm256_permute2f128_pd ,直到AVX512F引入更精细的粒度vpermt2pdvpermt2ps (和等效的整数shuffle。

AVX1 doesn't have any lane-crossing shuffles with granularity smaller than 128-bit, not even 1-source versions. AVX1没有任何小于128位的粒度的交叉混洗,甚至没有单源版本。 If you need one, you have to build it out of vinsertf128 or vperm2f128 + in-lane shuffles. 如果你需要一个,你必须用vinsertf128vperm2f128 +车道内shuffle来构建它。


Thus, keeping 3D vectors in SIMD vectors is even worse with AVX than it is for float with 128-bit vectors. 因此,使用AVX将3D矢量保持在SIMD矢量中比使用128位矢量的float更差。 http://fastcpp.blogspot.com/2011/04/vector-cross-product-using-sse-code.html might be faster than scalar, but it's much worse than you can do if you design your data layout for SIMD. http://fastcpp.blogspot.com/2011/04/vector-cross-product-using-sse-code.html可能比标量更快,但是如果你为SIMD设计数据布局,它会比你做得更糟。

Use separate arrays of contiguous x[] , y[] , and z[] so you can do 4x cross products in parallel with no shuffling, and take advantage of FMA instructions. 使用单独的连续x[]y[]z[]数组,这样您就可以并行执行4x交叉产品而不进行改组,并利用FMA指令。 Use SIMD to do multiple vectors in parallel, not to speed up single vectors. 使用SIMD并行执行多个向量,而不是加速单个向量。

See links in https://stackoverflow.com/tags/sse/info , especially https://deplinenoise.wordpress.com/2015/03/06/slides-simd-at-insomniac-games-gdc-2015/ which explains the data-layout issue quite well, and which level of a loop to vectorize with SIMD. 请参阅https://stackoverflow.com/tags/sse/info中的链接,特别是https://deplinenoise.wordpress.com/2015/03/06/slides-simd-at-insomniac-games-gdc-2015/数据布局问题很好,以及使用SIMD进行矢量化的循环级别。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM