SIMD版本的SHLD / SHRD指令

Question

SHLD/SHRD instructions are assembly instructions to implement multiprecisions shifts. SHLD / SHRD指令是用于实现多精度移位的汇编指令。

Consider the following problem: 请考虑以下问题：

uint64_t array[4] = {/*something*/};
left_shift(array, 172);
right_shift(array, 172);

What is the most efficient way to implement left_shift and right_shift , two functions that operates a shift on an array of four 64-bit unsigned integer as if it was a big 256 bits unsigned integer? 实现left_shift和right_shift的最有效方法是什么，这两个函数操作四个64位无符号整数数组的移位，好像它是一个大的256位无符号整数？

Is the most efficient way of doing that is by using SHLD/SHRD instructions, or is there better (like SIMD versions) instructions on modern architecture? 最有效的方法是使用SHLD / SHRD指令，还是有更好的（如SIMD版本）现代架构指令？

Answer 1

In this answer I'm only going to talk about x64. 在这个答案中，我只想谈谈x64。
x86 has been outdated for 15 years now if you're coding in 2016 it hardly makes sense to be stuck in 2000. x86已经过时了15年，如果你在2016年进行编码，那么在2000年陷入困境几乎没有意义。
All times are according to Agner Fog's instruction tables . 所有时间都是根据Agner Fog的说明表。

Intel Skylake example timings* 英特尔Skylake示例时间*
The shld / shrd instructions are rather slow on x64. shld / shrd指令在x64上相当慢。
Even on Intel skylake they have a latency of 4 cycles and uses 4 uops meaning it uses up a lot of execution units, on older processors they're even slower. 即使在英特尔Skylake上，它们也有4个周期的延迟并且使用4个uop意味着它占用了大量的执行单元，在较旧的处理器上它们甚至更慢。
I'm going to assume you want to shift by a variable amount, which means a 我假设你想要换一个可变数量，这意味着一个

SHLD RAX,RDX,cl        4 uops, 4 cycle latency.  -> 1/16 per bit

Using 2 shifts + add you can do this 使用2班+添加你可以做到这一点 faster 快点 slower. 慢点。

@Init:
MOV R15,-1
SHR R15,cl    //mask for later use.    
@Work:
SHL RAX,cl        3 uops, 2 cycle latency
ROL RDX,cl        3 uops, 2 cycle latency
AND RDX,R15       1 uops, 0.25 latency
OR RAX,RDX        1 uops, 0.25 latency    
//Still needs unrolling to achieve least amount of slowness.

Note that this only shifts 64 bits because RDX is not affected. 请注意，这只会移位64位，因为RDX不受影响。
So you're trying to beat 4 cycles per 64 bits. 所以你试图每64位击败4个周期。

//4*64 bits parallel shift.  
//Shifts in zeros.
VPSLLVQ YMM2, YMM2, YMM3    1uop, 0.5 cycle latency.

However if you want it to do exactly what SHLD does you'll need to use an extra VPSLRVQ and an OR to combine the two results. 但是，如果您希望它完全与SHLD一样，您需要使用额外的VPSLRVQ和OR来组合这两个结果。

VPSLLVQ YMM1, YMM2, YMM3    1uop, 0.5 cycle latency.  
VPSRLVQ YMM5, YMM2, YMM4    1uop, 0.5 cycle latency.   
VPOR    YMM1, YMM1, YMM5    1uop, 0.33 cycle latency.

You'll need to interleave 4 sets of these costing you (3*4)+2=14 YMM registers. 您将需要交错4套这些成本（3 * 4）+ 2 = 14 YMM寄存器。
Doing so I doubt you'll profit from the low .33 latency of VPADDQ so I'll assume a 0.5 latency instead. 这样做我怀疑你会从VPADDQ的低.33延迟中获利，所以我假设延迟为0.5。
That makes 3uops, 1.5 cycle latency for 256 bits = 1/171 per bit = 0.37 cycle per QWord = 10x faster, not bad. 这使得3uops，256位的1.5周期延迟=每位1/171 =每QWord 0.37个周期=快10倍，不错。
If you are able to get 1.33 cycle per 256 bits = 1/192 per bit = 0.33 cycle per QWord = 12x faster. 如果每个256位能够获得1.33个周期=每位1/192 =每QWord 0.33个周期= 12倍速。

'It's the Memory, Stupid!' “这是记忆，愚蠢！”
Obviously I've not added in loop overhead and load/stores to/from memory. 显然我没有添加循环开销和加载/存储到内存。
The loop overhead is tiny given proper alignment of jump targets, but the memory 给定跳跃目标的正确对齐，但是内存，循环开销很小
access will easily be the biggest slowdown. 访问将很容易成为最大的放缓。
A single cache miss to main memory on Skylake can cost you more than 250 cycles ¹ . Skylake主内存的单个缓存未命中可能会花费超过250个周期¹ 。
It is in clever management of memory that the major gains will be made. 巧妙的记忆管理将取得重大进展。
The 12 times possible speed-up using AVX256 is small potatoes in comparison. 相比之下，使用AVX256进行12次加速可能是小马铃薯。

I'm not counting the set up of the shift counter in CL / (YMM3/YMM4) because I'm assuming you'll reuse that value over many iterations. 我不计算CL / (YMM3/YMM4)移位计数器的设置，因为我假设你将在多次迭代中重用该值。

You're not going to beat that with AVX512 instructions, because consumer grade CPU's with AVX512 instructions are not yet available. 你不会用AVX512指令击败它，因为带有AVX512指令的消费级CPU尚不可用。
The only current processor that supports currently is Knights Landing . 目前唯一支持的处理器是Knights Landing 。

*) All these timings are best case values, and should be taken as indications, not as hard values. *）所有这些时间都是最佳案例值，应作为指示，而不是硬值。
¹ ) Cost of cache miss in Skylake: 42 cycles + 52ns = 42 + (52*4.6Ghz) = 281 cycles. ¹ ）Skylake的高速缓存未命中成本：42个周期+ 52ns = 42 +（52 * 4.6Ghz）= 281个周期。

SIMD版本的SHLD / SHRD指令

问题描述

1 个解决方案

解决方案1
5 已采纳 2016-09-01 17:36:38

SIMD版本的SHLD / SHRD指令

问题描述

1 个解决方案

解决方案1 5 已采纳 2016-09-01 17:36:38

解决方案1
5 已采纳 2016-09-01 17:36:38