If I have
movmr x,r1
Is it possible to do?
mul r1,r1
As in (x*x)
. I'm trying to efficiently do this to save bytes but this is the best possible solution I can think of so far and can't seem to find if it's allowed.
The whole equation is (x+y)(xy)
and so i reduced it to x^2 - y^2
.
Additionally if you were wondering, f+d /exe is based on per byte.
OPC = 8bits, x/y = 20bits, reg = 3bits. So movmr x,r1
is 4f+d and 4 exe
Edit: We're using a linux-based system
OPC|DST,SRC,xx| <= |1byte|1byte|
Most ISAs don't have this kind of restriction, and any that do will document it.
Normally instructions read all their input operands before writing any of their output operands, so it's fine if they overlap. Any restrictions will always be documented in ISA manuals / instruction-set references.
You usually only find restrictions with instructions that write more than one register, in which case unpredictable behaviour or an illegal instruction exception is normal when you give the same register for two outputs. For example, AVX512 vpgatherqq
:
The instruction will #UD fault if the destination vector zmm1 is the same as index vector VINDEX.
The AVX2 version doesn't mention this in the ISA ref manual, but I forget if there's a rule against it anywhere else.
One case where it is illegal is ARM: MUL Rd, Rm, Rs
does Rd := Rm × Rs
In early ARM versions(?), the behaviour is unpredictable if Rd and Rm are the same register. ( ARM wiki , and some version of official ARM docs ). Perhaps early microarchitectures did some kind of multi-step micro-coded calculation and accumulated the result in the destination register.
MUL r1,r1,r6 ; incorrect: Rd cannot be the same as Rm
MUL r1,r6,r1 ; correct: r1 *= r6
A later version of ARM documentation doesn't mention this restriction, so I guess doesn't apply to later architectures? Or google isn't finding good ISA docs. These seem to be docs for ARM's assembler. It's certainly likely that later ARM architecture versions don't have the restriction, but IDK why later docs don't mention when the restriction was removed.
davespace says that it's Rs and Rm (the two source operands) that can't be the same. That doesn't match what any other docs say, and makes less sense microarchitecturally, so I think it's wrong.
There's also a restriction on ARM's 32x32 => 64 bit full-multiply umull Rhi, Rlo, Rm, Rs
: Rhi, Rlo, and Rm all have to be different registers.
UMULL r1, r0, r0, r0 ; unpredictable, Rlo and Rm are the same.
UMULL r2, r1, r0, r0 ; r2:r1 = r0*r0
The whole equation is
(x+y)(xy)
and so i reduced it tox^2 - y^2
.
That transformation makes it more expensive, not less, in the absence of any surrounding code. add/sub are cheaper than multiply: better throughput and lower latency. On x86, given x and y in registers, you'd do
; x=eax
; y=edx
lea ecx, [rax + rdx] ; x+y
sub eax, edx ; x-y
imul ecx, eax ; (x+y) * (x-y)
4 cycle latency on Intel SnB-family. (3-cycle imul
, and lea/sub can run in parallel. http://agner.org/optimize/ ). vs.
imul eax, eax
imul edx, edx
sub eax, edx
This has 5 cycle latency if eax and edx are ready at the same time. No existing x86 CPUs have more than 1 scalar multiply execution unit, so there's a resource conflict: the 2nd imul
has to wait a cycle before it can execute. Depending on the surrounding code, port1 might not be a throughput bottleneck, and maybe one or the other of the inputs are ready a cycle earlier anyway.
However, if x
or y
is invariant, you can compute a new (x+y) * (xy)
more cheaply this way with just 2 instructions, CSEing the square that doesn't change.
This destroys both inputs, so if you need x or y after this you need a mov
. The other version preserves y
(in edx
) and leaves xy
in a register.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.