简体   繁体   English

从/向xmm / ymm寄存器加载/存储通用寄存器的最佳方法

[英]Best way to load/store from/to general purpose registers to/from xmm/ymm register

What is best way to load and store generate purpose registers to/from SIMD registers? 从SIMD寄存器加载和存储生成目的寄存器的最佳方法是什么? So far I have been using the stack as a temporary. 到目前为止,我一直在使用堆栈作为临时。 For example, 例如,

mov [rsp + 0x00], r8
mov [rsp + 0x08], r9
mov [rsp + 0x10], r10
mov [rsp + 0x18], r11
vmovdqa ymm0, [rsp] ; stack is properly aligned first.

I don't think there's any instruction can do this directly (or the other direction), since it would mean an instruction with five operands. 我认为没有任何指令可以直接(或另一个方向)执行此操作,因为它意味着具有五个操作数的指令。 However, the code above seems silly to me. 但是,上面的代码对我来说似乎很愚蠢。 Is there a better way to do it? 有没有更好的方法呢? I can only think of one alternative, use the pinsrd and related instructions. 我只能想到一个替代方案,使用pinsrd和相关说明。 But it does not seem any better. 但它似乎没有任何好转。

The motivation is that, sometime it is faster to do some things in AVX2 while others with general purpose register. 动机是,有时候在AVX2中做一些事情会更快,而其他用于通用的注册事项。 For example, say within a small piece of code, there are four 64-bit unsigned integers, I will need four xor , two mulx from BMI2. 例如,在一小段代码中,有四个64位无符号整数,我将需要四个xor ,两个mulx来自BMI2。 It will be faster to do the xor with vpxor , however, mulx does not have an AVX2 equivalent. 使用vpxor执行xor会更快,但是, mulx没有AVX2等价物。 Any performance of gain of vpxor vs 4 xor is lost due to the process of packing and unpacking. 由于打包和拆包过程, vpxor与4 xor任何增益vpxor丢失。

Is your bottleneck latency, throughput, or fused-domain uops? 您的瓶颈延迟,吞吐量或融合域uops? If it's latency, then store/reload is horrible, because of the store-forwarding stall from narrow stores to a wide load. 如果它是延迟,那么存储/重新加载是可怕的,因为存储转发从窄存储到大负载的停顿。

For throughput and fused-domain uops, it's not horrible: Just 5 fused-domain uops, bottlenecking on the store port. 对于吞吐量和融合域uops,它并不可怕:只有5个融合域uops,商店端口上的瓶颈。 If the surrounding code is mostly ALU uops, it's worth considering. 如果周围的代码主要是ALU uops,那么值得考虑。


For the use-case you propose: 对于用例,您建议:

Spending a lot of instructions/uops on moving data between integer and vector regs is usually a bad idea. 在整数和向量寄存器之间花费大量指令/ uop来移动数据通常是一个坏主意。 PMULUDQ does give you the equivalent of a 32-bit mulx, but you're right that 64-bit multiplies aren't available directly in AVX2. PMULUDQ确实提供了相当于32位mulx的功能,但你确实在AVX2中不能直接使用64位乘法器。 (AVX512 has them). (AVX512有它们)。

You can do a 64-bit vector multiply using the usual extended-precision techniques with PMULUDQ. 您可以使用PMULUDQ的常用扩展精度技术进行64位向量乘法。 My answer on Fastest way to multiply an array of int64_t? 我的答案是最快的方法来乘以一个int64_t数组? found that vectorizing 64 x 64 => 64b multiplies was worth it with AVX2 256b vectors, but not with 128b vectors. 发现矢量化64 x 64 => 64b乘法值得用AVX2 256b矢量,但不是128b矢量。 But that was with data in memory, not with data starting and ending in vector regs. 但那是内存中的数据,而不是数据在向量regs中开始和结束。

In this case, it might be worth building a 64x64 => 128b full multiply out of multiple 32x32 => 64-bit vector multiplies, but it might take so many instructions that it's just not worth it. 在这种情况下, 可能值得在多个32x32 => 64位向量乘法中构建64x64 => 128b完全乘法,但可能需要这么多指令而不值得。 If you do need the upper-half results, unpacking to scalar (or doing your whole thing scalar) might be best. 如果你确实需要上半部分的结果,那么解压缩到标量(或者做你的整个标量)可能是最好的。

Integer XOR is extremely cheap, with excellent ILP (latency=1, throughput = 4 per clock). 整数XOR非常便宜,具有出色的ILP(延迟= 1,吞吐量=每时钟4个)。 It's definitely not worth moving your data into vector regs just to XOR it, if you don't have anything else vector-friendly to do there. 如果你没有其他任何对矢量友好的东西,那么将你的数据移动到向量寄存器中绝对不值得。 See the tag wiki for performance links. 有关性能链接,请参阅 标记wiki


Probably the best way for latency is: 延迟的最佳方式可能是:

vmovq   xmm0, r8
vmovq   xmm1, r10            # 1uop for p5 (SKL), 1c latency
vpinsrq xmm0, r9, 1          # 2uops for p5 (SKL), 3c latency
vpinsrq xmm1, r11, 1
vinserti128 ymm0, ymm0, ymm1, 1    # 1uop for p5 (SKL), 3c latency

Total: 7 uops for p5, with enough ILP to run them almost all back-to-back. 总计:p5为7 uop,有足够的ILP来运行它们几乎所有背靠背。 Since presumably r8 will be ready a cycle or two sooner than r10 anyway, you're not losing much. 因为大概r8将比r10更快地准备一个或两个周期,你不会损失太多。


Also worth considering: whatever you were doing to produce r8..r11, do it with vector-integer instructions so your data is already in XMM regs. 另外值得考虑的是:无论你做什么来生产r8..r11,都要使用向量整数指令,这样你的数据已经在XMM regs中。 Then you still need to shuffle them together, though, with 2x PUNPCKLQDQ and VINSERTI128. 然后你仍然需要将它们混合在一起,使用2x PUNPCKLQDQ和VINSERTI128。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM