简体   繁体   中英

Possible to mul r1,r1?

If I have

movmr x,r1

Is it possible to do?

mul r1,r1 

As in (x*x) . I'm trying to efficiently do this to save bytes but this is the best possible solution I can think of so far and can't seem to find if it's allowed.

The whole equation is (x+y)(xy) and so i reduced it to x^2 - y^2 .

Additionally if you were wondering, f+d /exe is based on per byte.

OPC = 8bits, x/y = 20bits, reg = 3bits. So movmr x,r1 is 4f+d and 4 exe

Edit: We're using a linux-based system

OPC|DST,SRC,xx| <= |1byte|1byte|

Most ISAs don't have this kind of restriction, and any that do will document it.

Normally instructions read all their input operands before writing any of their output operands, so it's fine if they overlap. Any restrictions will always be documented in ISA manuals / instruction-set references.

You usually only find restrictions with instructions that write more than one register, in which case unpredictable behaviour or an illegal instruction exception is normal when you give the same register for two outputs. For example, AVX512 vpgatherqq :

The instruction will #UD fault if the destination vector zmm1 is the same as index vector VINDEX.

The AVX2 version doesn't mention this in the ISA ref manual, but I forget if there's a rule against it anywhere else.


One case where it is illegal is ARM: MUL Rd, Rm, Rs does Rd := Rm × Rs

In early ARM versions(?), the behaviour is unpredictable if Rd and Rm are the same register. ( ARM wiki , and some version of official ARM docs ). Perhaps early microarchitectures did some kind of multi-step micro-coded calculation and accumulated the result in the destination register.

MUL     r1,r1,r6    ; incorrect: Rd cannot be the same as Rm
MUL     r1,r6,r1    ; correct:  r1 *= r6

A later version of ARM documentation doesn't mention this restriction, so I guess doesn't apply to later architectures? Or google isn't finding good ISA docs. These seem to be docs for ARM's assembler. It's certainly likely that later ARM architecture versions don't have the restriction, but IDK why later docs don't mention when the restriction was removed.

davespace says that it's Rs and Rm (the two source operands) that can't be the same. That doesn't match what any other docs say, and makes less sense microarchitecturally, so I think it's wrong.


There's also a restriction on ARM's 32x32 => 64 bit full-multiply umull Rhi, Rlo, Rm, Rs : Rhi, Rlo, and Rm all have to be different registers.

UMULL  r1, r0, r0, r0     ; unpredictable, Rlo and Rm are the same. 
UMULL  r2, r1, r0, r0     ; r2:r1  =  r0*r0

The whole equation is (x+y)(xy) and so i reduced it to x^2 - y^2 .

That transformation makes it more expensive, not less, in the absence of any surrounding code. add/sub are cheaper than multiply: better throughput and lower latency. On x86, given x and y in registers, you'd do

; x=eax
; y=edx

lea  ecx, [rax + rdx]     ; x+y
sub  eax, edx             ; x-y
imul ecx, eax             ; (x+y) * (x-y)

4 cycle latency on Intel SnB-family. (3-cycle imul , and lea/sub can run in parallel. http://agner.org/optimize/ ). vs.

imul  eax, eax
imul  edx, edx
sub   eax, edx

This has 5 cycle latency if eax and edx are ready at the same time. No existing x86 CPUs have more than 1 scalar multiply execution unit, so there's a resource conflict: the 2nd imul has to wait a cycle before it can execute. Depending on the surrounding code, port1 might not be a throughput bottleneck, and maybe one or the other of the inputs are ready a cycle earlier anyway.

However, if x or y is invariant, you can compute a new (x+y) * (xy) more cheaply this way with just 2 instructions, CSEing the square that doesn't change.

This destroys both inputs, so if you need x or y after this you need a mov . The other version preserves y (in edx ) and leaves xy in a register.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM