简体   繁体   中英

Moving 2 QWORDs from general purpose registers into an XMM register as high/low

Working with masm for ml64, I'm trying to move 2 unsigned qwords from r9 and r10 into xmm0 as an unsigned 128b int

So far I came up with this:

mov r9, 111             ;low qword for test
mov r10, 222            ;high qword for test

movq xmm0, r9           ;move low to xmm0 lower bits
movq xmm1, r10          ;move high to xmm1 lower bits
pslldq xmm1, 4          ;shift xmm1 lower half to higher half   
por xmm0, xmm1          ;or the 2 halves together

I think it works because

movq rax, xmm0

returns the correct low value

psrldq xmm0, 4
movq rax, xmm0

returns the correct high value

Question is though, is there a better way to do it? I'm browsing the intel intrinsic guide but I'm not very good at guessing the names for whatever instructions they may possibly have.

Your byte-shift/OR is broken because you only shifted by 4 bytes not 8; it happens to work when your 8-byte qword test values don't have any bits set in their upper half.


The SSE/AVX SIMD instruction sets include an unpack instruction you can use for this:

mov r9, 111         ; test input: low half
mov r10, 222        ; test input: high half

vmovq xmm0, r9      ; move 64 bit wide general purpose register into lower xmm half
vmovq xmm1, r10     ; ditto

vpunpcklqdq xmm0, xmm0, xmm1    ; i.e. xmm0 = low(xmm1) low(xmm0)

That means the vpunpcklqdq instruction unpacks (or interleaves) each low source quad-word (= 64 bit) into a double quad-word (ie the full XMM register width).

In comparison with your original snippet you save one instruction.

(I've used the VEX AVX mnemonics. If you want to target SSE2 then you have to remove the v prefix.)


Alternatively, you can use an insert instruction to move the second value into the upper half:

mov r9, 111         ; test input
mov r10, 222        ; test input

vmovq xmm0, r9      ; move 64 bit wide general purpose register into lower xmm half

vpinsrq xmm0, xmm0, r10, 1    ; i.e. xmm0 = r9 low(ymm0)

Execution-wise, on a micro-op level, this doesn't make much of a difference, ie vpinsrq is as 'expensive' as vmov + vpunpcklqdq but it encodes into shorter code.

The non-AVX version of this requires SSE4.1 for pinsrq .

With a little help from your stack:

    push   r10
    push   r9
ifdef ALIGNED
    movdqa xmm0, xmmword ptr [esp]
else
    movdqu xmm0, xmmword ptr [esp]
endif
    add    esp, 16

If your __uint128 happens to live on the stack, just strip the superfluous instructions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM