简体   繁体   English

将浮动从高xmm四字移动到低xmm四字

[英]Move float from high xmm quadword to low xmm quadword

MOVHPD extracts high quadword of an xmm register into memory. MOVHPD将xmm寄存器的高位四字提取到内存中。

PEXTRQ extracts the high quadword of an xmm register and places it into an integer register (integers only). PEXTRQ提取xmm寄存器的高位四字并将其放入整数寄存器(仅整数)。

SHUFPD shuffles. SHUFPD随机播放。

VPSLLDQ causes the high quadword to be zeroed out. VPSLLDQ使高位四字清零。

Is there an instruction to move a floating-point value from the high quadword of an xmm register into the low quadword of the same xmm register or another xmm register? 是否有指令将浮点值从xmm寄存器的高位四字移动到同一xmm寄存器或另一个xmm寄存器的低位四字中? Or do I always have to go through memory (adding extra cycles)? 还是我总是必须经过内存(添加额外的周期)?

UPDATE: Based on comments below by @fuz and @Peter Cordes, here's what I did. 更新:根据以下@fuz和@Peter Cordes的评论,这是我所做的。 This calls a rounding function for the low and high quadwords of xmm0 individually; 这将分别为xmm0的上下四位数调用舍入函数; due to special rounding parameters, the function must be called for each qword individually, so it can't be a SIMD instruction. 由于特殊的舍入参数,必须为每个qword分别调用该函数,因此它不能是SIMD指令。 The goal is to round each of the qwords in xmm0 and put the result in xmm11. 目标是将xmm0中的每个qword取整并将结果放入xmm11中。

movapd xmm2,xmm0 ;preserve both qwords of xmm0
call Round
movsd [scratch_register+0],xmm0 ; write low qword to memory
movhlps xmm0,xmm2
call Round
movsd [scratch_register+8],xmm0 ; write low qword to memory
movupd xmm11,[scratch_register]

UPDATE #2: @Peter Cordes showed how to do this without memory: 更新#2:@Peter Cordes显示了如何在没有内存的情况下执行此操作:

movhlps  xmm2, xmm0   ; extract high qword for later
call Round            ; round the low qword
movaps   xmm3, xmm0   ; save the result
movaps   xmm0, xmm2   ; set up the arg
call Round            ; round the high qword
movlhps  xmm3, xmm0   ; re-combine into xmm3

See Agner Fog's asm optimization guide , his chapter on SIMD has a table of shuffle instructions different kinds of data movement that will give you a small number of instructions to think about (or look up in Intel's manuals if you don't remember exactly what they do) and see if they're what you want. 请参阅Agner Fog的asm优化指南 ,他在SIMD上的章节中有一张随机播放指令表,列出了不同类型的数据移动方式,这些指令可以让您考虑一些指令(如果您不记得它们的确切内容,请查阅Intel手册) ),看看它们是否就是您想要的。


The cheapest way to broadcast the high qword of a register to both elements is movhlps xmm0,xmm0 . 向两个元素广播寄存器的高qword的最便宜方法是movhlps xmm0,xmm0 (Or for integer data if your code might run on Nehalem, use punpckhqdq xmm0,xmm0 to avoid FP<->vec-int bypass delays.) (或者对于整数数据(如果您的代码可以在Nehalem上运行,请使用punpckhqdq xmm0,xmm0以避免FP <-> vec-int旁路延迟)。)

Without AVX, movhlps is nice because it does a slightly different shuffle than unpckhpd . 没有AVX, movhlps很不错,因为它的随机播放与unpckhpd略有不同。

  • movhlps xmm3, xmm4 does xmm3[0] = xmm4[1]; movhlps xmm3, xmm4是否xmm3[0] = xmm4[1]; , leaving xmm3[1] unchanged. ,使xmm3[1]保持不变。
  • unpckhpd xmm3, xmm4 takes the high qwords from xmm3 and xmm4 and puts them in xmm3 in that order. unpckhpd xmm3, xmm4从xmm3和xmm4提取高位unpckhpd xmm3, xmm4并将其按顺序放入xmm3中。 So in the destination, the high qword moves to low, then the high qword from the src is copied over. 因此,在目标中,高qword移到低,然后将src中的高qword复制过来。 xmm3[0] = xmm3[1]; xmm3[1] = xmm4[1]

But unpcklpd is useless, it's 1 byte longer and does the same thing as SSE1 movlhps . 但是unpcklpd没用,它长了1个字节,并且与SSE1 movlhps (copy low qword from the src to the high qword of the destination, leaving the low qword of the destination unmodified.) Same for movapd , always use movaps instead. (将src中的低位qword复制到目标的高位qword,而保留目标的低位qword movapd 。)与movapd相同,请始终使用movaps

Also re: code-size: it costs a REX prefix to use xmm8..15, so choose your register allocation to use xmm8..15 in as few instructions as possible (or ones that already need a REX prefix, eg for a pointer in r8..15). 同样是re:code-size:使用xmm8..15需要花费REX前缀,因此请选择寄存器分配以在尽可能少的指令中使用xmm8..15(或已经需要REX前缀的指令,例如用于指针)在r8..15中)。 Code-size isn't usually a big deal, but all else equal smaller is normally best. 代码大小通常并不重要,但是其他所有条件通常都较小。 Smaller instructions normally pack better into the uop cache. 较小的指令通常可以更好地打包到uop缓存中。


With AVX, you can use vunpckhpd with either order of source operands , with the first src's high qword going to the low qword of the destination. 使用AVX,您可以将vunpckhpd与源操作数的任意顺序一起使用 ,并且第一个src的高qword会到达目标的低qword。 There's no code-size advantage (or other perf advantage) for vmovhlps , they can both use a 2-byte VEX prefix for a minimum instruction size of 4 bytes. vmovhlps没有代码大小优势(或其他性能优势),它们都可以使用2字节的VEX前缀来实现最小4字节的指令大小。

eg vunpckhpd xmm0, xmm1, xmm0 is like vmovhlps xmm0, xmm0,xmm1 . 例如vunpckhpd xmm0, xmm1, xmm0就像vmovhlps xmm0, xmm0,xmm1


You could use shufpd or vpshufd for the problem you're trying to solve. 您可以使用shufpdvpshufd解决您要解决的问题。 It's a waste of code size because it needs an immediate, but apparently you didn't realize that you can use shufpd xmm0, xmm0, 0b11 to take (in this order): 因为它需要立即数,所以浪费了代码大小,但是显然您没有意识到可以使用shufpd xmm0, xmm0, 0b11来获取(按此顺序):

  • the low qword from xmm0[1] (first src operand, low bit of the immediate) xmm0[1]的低位qword(第一个src操作数,立即数的低位)
  • the high qword from xmm0[1] (second src operand, high bit of the immediate). xmm0[1]的高位qword(第二个src操作数,立即数的高位)。

The shuffle control can read the same input element multiple times. 随机播放控件可以多次读取同一输入元素。


Interestingly, the NASM compiler will compile VUNPCKHPD with only two operands 有趣的是,NASM编译器将仅使用两个操作数来编译VUNPCKHPD

NASM allows you to write instructions like vaddps xmm0, xmm0, xmm1 as vaddps xmm0, xmm1 , omitting the separate destination operand when it's the same as the first source. NASM允许您将vaddps xmm0, xmm0, xmm1等指令编写为vaddps xmm0, xmm1 ,并在与第一个源相同时省略单独的目标操作数。

I'm puzzled because these values are double precision, not single, but it works. 我很困惑,因为这些值是双精度的,而不是单精度的,但是它可以工作。

Everything is just bits/bytes to be copied around . 一切都只是要复制的位/字节 Unless you're using a FP computation instruction (eg like addpd / addps ), the "type" doesn't matter. 除非您使用FP计算指令(例如addpd / addps ), addps “类型”无关紧要。 (You can tell by the presence or absence of a "SIMD Floating-Point Exceptions" section in the manual entry whether it cares about the meaning of the bits as an FP bit pattern or not. eg addps : https://www.felixcloutier.com/x86/addps#simd-floating-point-exceptions . (But there aren't any surprises. The only instructions that do care do so for very obvious reasons, like doing FP computation or type conversion, not just copying data around.) (您可以通过手册条目中是否存在“ SIMD浮点异常”部分来判断是否将位的含义视为FP位模式。例如addpshttps://www.felixcloutier .com / x86 / addps#simd-floating-point-exceptions 。(但没有任何意外。唯一关心的指令是出于非常明显的原因这样做的,例如进行FP计算或类型转换,而不仅仅是在周围复制数据)

No real CPUs care about PS vs. PD instructions for performance, but some care about vec-int vs. vec-FP, so unfortunately it's not always a win to use pshufd to copy-and-shuffle FP data. 没有真正的CPU关心PS与PD的性能指令,而是关心vec-int与vec-FP的指令,因此不幸的是,使用pshufd复制和pshufd FP数据并不总是胜利。 Or to use shufps as a 2-source integer shuffle. 或将shufps用作2源整数随机播放。

Unfortunately before AVX512 there aren't general-purpose 2-source "integer" shuffles, only palignr and punpck instructions. 不幸的是,在AVX512之前,没有通用的2源“整数” palignr punpck ,只有palignrpunpck指令。 And before AVX, there aren't FP copy-and-shuffle instructions. 在AVX之前,没有FP复制和改组说明。 (And ironically, vpermilps with an immediate is redundant vs. vshufps dst, same,same, imm8 except for a memory-source load+shuffle, and should be avoided for code-size reasons. What's the point of the VPERMILPS instruction (_mm_permute_ps)? ) (讽刺的是, vpermilps用速是冗余对vshufps dst, same,same, imm8除了一个存储器源负载+洗牌,并且应避免代码大小的原因。 什么是VPERMILPS指令的点(_mm_permute_ps) ?


  movapd xmm2,xmm0 ;preserve both qwords of xmm0
  call Round
     movsd [scratch_register+0],xmm0 ; write low qword to memory
  movhlps xmm0,xmm2
  call Round

This is efficient shuffling, but unfortunately it creates a false dependency between the output of the first Round and the input to the 2nd . 这是有效的改组,但是不幸的是,它在第一个Round的输出和第二个Round的输入之间创建了错误的依赖关系 So the two calls can't work in parallel. 因此,这两个调用不能并行运行。 Instead, shuffle as you copy before the first call, preferably into a register you know has been "dead" for a while or was part of the dependency chain for the value in xmm0 so must be ready before it. 相反,在第一次调用之前复制时应随机播放,最好是进入已知已经“死”一段时间或属于xmm0中的值的依赖链的一部分的寄存器中,因此必须在此之前做好准备。

  movhlps  xmm2, xmm0   ; extract high qword for later
  call Round                ; round the low qword
  movaps   xmm3, xmm0   ; save the result
  movaps   xmm0, xmm2   ; set up the arg
  call Round                ; round the high qword
  movlhps  xmm3, xmm0    ; re-combine into xmm3

Unless you're running low on registers that your hand-written Round function doesn't touch, you don't particularly need memory and it's not more efficient. 除非您的笔迹不足,而您的手写Round函数不会碰到它,否则您将不需要特别的内存并且效率也不高。

As a bonus, all of those movaps and movhlps instructions are only 3 bytes long, and there's the same number of them as there are instructions in your version. 另外,所有这些movapsmovhlps指令都只有3个字节长,并且它们的数量与您的版本中的指令数量相同。

Another option (especially if your input was in a different register to start with) would be to Round the high half first, then you could put the high half back into xmm0 with movlhps . 另一种选择(尤其是如果你的输入是在一个不同的寄存器开始)将Round第一高的一半,那么你可以把高半回XMM0与movlhps

And BTW, if you have SSE4.1, roundpd can round to nearest integer with Nearest, towards +-Inf (ceil/floor), or towards 0 (truncation). 而且,顺便说一句,如果您具有SSE4.1, roundpd可以四舍五入为具有Nearest的最接近整数,朝+ -Inf(上限/下限)或接近0(截断)。


movsd [scratch_register+8],xmm0 ; write low qword to memory
movupd xmm11,[scratch_register]

Never do this, narrow store + wide reload is a guaranteed store-forwarding stall. 永远不要这样做,狭窄的仓库+宽的装货量是保证的仓库转发停滞期。 (~10 cycles extra latency). (约10个周期的额外延迟)。

Use a 16-byte aligned storage location (eg on the stack at [rsp+8] or something), and 使用16字节对齐的存储位置(例如,在堆栈上[rsp+8]或其他位置),然后
unpckhpd xmm0, [scratch_register] to load+shuffle . unpckhpd xmm0, [scratch_register]加载+随机播放

Unfortunately Intel designed memory-source unpck instructions badly, so they require a 16-byte memory source, not just the 8 bytes they actually load/use. 不幸的是,英特尔对内存源unpck指令的设计很差,因此它们需要16字节的内存源,而不仅仅是它们实际加载/使用的8字节。 There are several cases where the 在几种情况下

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM