简体   繁体   English

可能要合并r1,r1吗?

[英]Possible to mul r1,r1?

If I have 如果我有

movmr x,r1

Is it possible to do? 有可能吗?

mul r1,r1 

As in (x*x) . (x*x) I'm trying to efficiently do this to save bytes but this is the best possible solution I can think of so far and can't seem to find if it's allowed. 我正在尝试有效地执行此操作以节省字节,但这是迄今为止我能想到的最好的解决方案,并且似乎无法找到是否允许这样做。

The whole equation is (x+y)(xy) and so i reduced it to x^2 - y^2 . 整个方程为(x+y)(xy) ,因此我将其简化为x^2 - y^2

Additionally if you were wondering, f+d /exe is based on per byte. 此外,如果您想知道,f + d / exe是基于每个字节的。

OPC = 8bits, x/y = 20bits, reg = 3bits. OPC = 8位,x / y = 20位,reg = 3位。 So movmr x,r1 is 4f+d and 4 exe 所以movmr x,r1是4f + d和4 exe

Edit: We're using a linux-based system 编辑:我们正在使用基于Linux的系统

OPC|DST,SRC,xx| OPC | DST,SRC,XX | <= |1byte|1byte| <= | 1byte | 1byte |

Most ISAs don't have this kind of restriction, and any that do will document it. 大多数ISA都没有这种限制,任何有限制的都可以记录下来。

Normally instructions read all their input operands before writing any of their output operands, so it's fine if they overlap. 通常,指令在写入任何输出操作数之前先读取其所有输入操作数,因此,如果它们重叠,就可以了。 Any restrictions will always be documented in ISA manuals / instruction-set references. 任何限制都将始终记录在ISA手册/指令集参考中。

You usually only find restrictions with instructions that write more than one register, in which case unpredictable behaviour or an illegal instruction exception is normal when you give the same register for two outputs. 通常,您只会发现对写入多个寄存器的指令的限制,在这种情况下,当您为两个输出提供相同的寄存器时,异常行为或非法指令异常是正常的。 For example, AVX512 vpgatherqq : 例如, AVX512 vpgatherqq

The instruction will #UD fault if the destination vector zmm1 is the same as index vector VINDEX. 如果目标向量zmm1与索引向量VINDEX相同,则该指令将#UD错误。

The AVX2 version doesn't mention this in the ISA ref manual, but I forget if there's a rule against it anywhere else. AVX2版本在ISA参考手册中没有提及这一点,但我忘记了在其他任何地方是否有反对它的规则。


One case where it is illegal is ARM: MUL Rd, Rm, Rs does Rd := Rm × Rs 一种非法的情况是ARM: MUL Rd, Rm, Rs确实Rd := Rm × Rs

In early ARM versions(?), the behaviour is unpredictable if Rd and Rm are the same register. 在早期的ARM版本(?)中,如果Rd和Rm是同一寄存器,则该行为是不可预测的。 ( ARM wiki , and some version of official ARM docs ). ARM Wiki和一些正式的ARM文档版本)。 Perhaps early microarchitectures did some kind of multi-step micro-coded calculation and accumulated the result in the destination register. 也许早期的微体系结构进行了一些多步微编码计算,并将结果累加到目标寄存器中。

MUL     r1,r1,r6    ; incorrect: Rd cannot be the same as Rm
MUL     r1,r6,r1    ; correct:  r1 *= r6

A later version of ARM documentation doesn't mention this restriction, so I guess doesn't apply to later architectures? 更高版本的ARM文档没有提及此限制,因此我想它不适用于更高的体系结构吗? Or google isn't finding good ISA docs. 否则,谷歌找不到合适的ISA文档。 These seem to be docs for ARM's assembler. 这些似乎是ARM汇编程序的文档。 It's certainly likely that later ARM architecture versions don't have the restriction, but IDK why later docs don't mention when the restriction was removed. 以后的ARM体系结构版本肯定没有此限制,但是IDK为什么后来的文档没有提到取消限制的时间。

davespace says that it's Rs and Rm (the two source operands) that can't be the same. davespace表示 Rs和Rm(两个源操作数)不能相同。 That doesn't match what any other docs say, and makes less sense microarchitecturally, so I think it's wrong. 这与任何其他文档所说的都不匹配,并且从微体系结构的角度讲意义不大,所以我认为这是错误的。


There's also a restriction on ARM's 32x32 => 64 bit full-multiply umull Rhi, Rlo, Rm, Rs : Rhi, Rlo, and Rm all have to be different registers. ARM的32x32 => 64位全乘umull Rhi, Rlo, Rm, Rs也有一个限制:Rhi,Rlo和Rm都必须是不同的寄存器。

UMULL  r1, r0, r0, r0     ; unpredictable, Rlo and Rm are the same. 
UMULL  r2, r1, r0, r0     ; r2:r1  =  r0*r0

The whole equation is (x+y)(xy) and so i reduced it to x^2 - y^2 . 整个方程为(x+y)(xy) ,因此我将其简化为x^2 - y^2

That transformation makes it more expensive, not less, in the absence of any surrounding code. 在没有任何周围代码的情况下,这种转换使它变得更加昂贵,而不是更少。 add/sub are cheaper than multiply: better throughput and lower latency. 添加/订阅比乘法便宜:更好的吞吐量和更低的延迟。 On x86, given x and y in registers, you'd do 在x86上,给定寄存器中的x和y,

; x=eax
; y=edx

lea  ecx, [rax + rdx]     ; x+y
sub  eax, edx             ; x-y
imul ecx, eax             ; (x+y) * (x-y)

4 cycle latency on Intel SnB-family. Intel SnB系列的4个周期延迟。 (3-cycle imul , and lea/sub can run in parallel. http://agner.org/optimize/ ). (3循环imul ,和LEA /子可以并行运行。 http://agner.org/optimize/ )。 vs.

imul  eax, eax
imul  edx, edx
sub   eax, edx

This has 5 cycle latency if eax and edx are ready at the same time. 如果eax和edx同时准备就绪,则有5个周期的延迟。 No existing x86 CPUs have more than 1 scalar multiply execution unit, so there's a resource conflict: the 2nd imul has to wait a cycle before it can execute. 没有现有的x86 CPU具有超过1个标量乘法执行单元,因此存在资源冲突:第二个imul必须等待一个周期才能执行。 Depending on the surrounding code, port1 might not be a throughput bottleneck, and maybe one or the other of the inputs are ready a cycle earlier anyway. 取决于周围的代码,端口1可能不是吞吐量瓶颈,也许其中一个输入或另一个输入都可以提前一个周期准备好。

However, if x or y is invariant, you can compute a new (x+y) * (xy) more cheaply this way with just 2 instructions, CSEing the square that doesn't change. 但是,如果xy是不变的,则只需2条指令,您就可以通过这种方式便宜地计算出新的(x+y) * (xy) ,CSE处理不变的平方。

This destroys both inputs, so if you need x or y after this you need a mov . 这会破坏两个输入,因此如果之后需要x或y,则需要mov The other version preserves y (in edx ) and leaves xy in a register. 另一个版本保留y (在edx )并将xy保留在寄存器中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM