简体   繁体   English

使用 CMP reg,0 与 OR reg,reg 测试寄存器是否为零?

[英]Test whether a register is zero with CMP reg,0 vs OR reg,reg?

Is there any execution speed difference using the following code:使用以下代码是否有任何执行速度差异:

cmp al, 0
je done

and the following:以及以下内容:

or al, al
jz done

I know that the JE and JZ instructions are the same, and also that using OR gives a size improvement of one byte.我知道 JE 和 JZ 指令是相同的,而且使用 OR 可以使大小增加一个字节。 However, I am also concerned with code speed.但是,我也关心代码速度。 It seems that logical operators will be faster than a SUB or a CMP, but I just wanted to make sure.似乎逻辑运算符会比 SUB 或 CMP 快,但我只是想确定一下。 This might be a trade-off between size and speed, or a win-win (of course the code will be more opaque).这可能是大小和速度之间的权衡,也可能是双赢(当然代码会更不透明)。

Yes , there is a difference in performance.是的,性能存在差异。

The best choice for comparing a register with zero is test reg, reg .将寄存器与零进行比较的最佳选择是test reg, reg It sets FLAGS the same way cmp reg,0 would, and is at least as fast 1 as any other way, with smaller code-size.它设置 FLAGS 的方式与cmp reg,0相同,并且至少与任何其他方式一样快1 ,但代码大小更小。

(Even better is when ZF is already set appropriately by the instruction that set reg so you can just branch, setcc, or cmovcc directly. For example, the bottom of a normal loop often looks like dec ecx / jnz .loop_top . Most x86 integer instructions "set flags according to the result", including ZF=1 if the output was 0 .). (更好的是,当ZF已经通过设置reg的指令适当设置时,您可以直接使用 branch、setcc 或 cmovcc。例如, 正常循环的底部通常看起来像dec ecx / jnz .loop_top 。大多数 x86 整数指令“根据结果设置标志”,如果输出为0 ,则包括 ZF=1 。)

or reg,reg can't macro-fuse with a JCC into a single uop on any existing x86 CPUs, and adds latency for anything that later reads reg because it rewrites the value into the register. or reg,reg不能与JCC 宏融合到任何现有 x86 CPU 上的单个 uop 中,并为以后读取reg任何内容增加延迟,因为它将值重写到寄存器中。 cmp 's downside is usually just code-size. cmp的缺点通常只是代码大小。

Footnote 1: There is a possible exception, but only on obsolete P6-family CPUs (Intel up to Nehalem, replaced by Sandybridge-family in 2011).脚注 1:有一个可能的例外,但仅限于过时的 P6 系列 CPU(英特尔直到 Nehalem,在 2011 年被 Sandybridge 系列取代)。 See below about avoiding register-read stalls by rewriting the same value into a register.请参阅下文,了解如何通过将相同值重写到寄存器中来避免寄存器读取停顿。 Other microarchitecture families don't have such stalls, and there's never any upside to or over test .其他微体系结构系列没有这样的障碍,而且从来没有任何好处or过度test


The FLAGS results of test reg,reg / and reg,reg / or reg,reg are test reg,reg / and reg,reg / or reg,regFLAGS结果为
identical to cmp reg, 0 in all cases (except for AF) because :cmp reg, 0相同cmp reg, 0在所有情况下均为cmp reg, 0 (AF 除外),因为

  • CF = OF = 0 because test / and always do that, and for cmp because subtracting zero can't overflow or carry. CF = OF = 0因为test / and总是这样做,而对于cmp因为减去零不能溢出或进位。
  • ZF , SF , PF set according to the result (ie reg ): reg&reg for test, or reg - 0 for cmp. ZF , SF , PF根据结果​​设置(即reg ): reg&reg用于测试,或reg - 0用于 cmp。

( AF is undefined after test , but set according to the result for cmp . I'm ignoring it because it's really obscure: the only instructions that read AF are the ASCII-adjust packed-BCD instructions like AAS , and lahf / pushf .) AFtest之后未定义,但根据cmp的结果设置。我忽略它,因为它真的很模糊:读取 AF 的唯一指令是 ASCII 调整压缩 BCD 指令,如AASlahf / pushf 。)

You can of course check conditions other than reg == 0 (ZF), eg test for negative signed integers by looking at SF.您当然可以检查除reg == 0 (ZF) 以外的条件,例如通过查看 SF 来测试负有符号整数。 But fun fact: jl , the signed less-than condition, is more efficient than js on some CPUs after a cmp .但有趣的事实是: jl ,带符号的小于条件,在cmp之后在某些 CPU 上比js更有效。 They're equivalent after compare against zero because OF=0 so the l condition ( SF!=OF ) is equivalent to SF .它们在与零比较后是等价的,因为 OF​​=0 所以l条件( SF!=OF )等价于SF

Every CPU that can macro-fuse TEST/JL can also macro-fuse TEST/JS, even Core 2. But after CMP byte [mem], 0 , always use JL not JS to branch on the sign bit because Core 2 can't macro-fuse that.每个可以宏融合TEST/JL 的 CPU 也可以融合 TEST/JS,甚至是 Core 2。但是在CMP byte [mem], 0 ,总是使用 JL 而不是 JS 在符号位上进行分支,因为 Core 2 不能宏保险丝。 (At least in 32-bit mode; Core 2 can't macro-fuse at all in 64-bit mode). (至少在 32 位模式下;Core 2 在 64 位模式下根本无法进行宏融合)。

The signed-compare conditions also let you do stuff like jle or jg , looking at ZF as well as SF!=OF.符号比较条件还允许您执行jlejg ,查看 ZF 和 SF!=OF。


test is shorter to encode than cmp with immediate 0, in all cases except the cmp al, imm8 special case which is still two bytes. test比带有立即数 0 的cmp编码更短,在所有情况下,除了cmp al, imm8特殊情况,它仍然是两个字节。

Even then, test is preferable for macro-fusion reasons (with jle and similar on Core2), and because having no immediate at all can possibly help uop-cache density by leaving a slot that another instruction can borrow if it needs more space (SnB-family).即便如此,出于宏融合的原因(与jle和 Core2 上的类似), test是可取的,并且因为根本没有立即数可以通过留下一个插槽来帮助 uop-cache 密度,如果另一个指令需要更多空间(SnB) -家庭)。


Macro-fusion of test/jcc into a single uop in the decoders将 test/jcc 宏融合到解码器中的单个 uop 中

The decoders in Intel and AMD CPUs can internally macro-fuse test and cmp with some conditional branch instructions into a single compare-and-branch operation. Intel 和 AMD CPU 中的解码器可以在内部testcmp与一些条件分支指令宏融合到单个比较和分支操作中。 This gives you a max throughput of 5 instructions per cycle when macro-fusion happens, vs. 4 without macro-fusion.当宏融合发生时,这为您提供每个周期 5 条指令的最大吞吐量,而没有宏融合时则为 4 条指令。 (For Intel CPUs since Core2.) (适用于自 Core2 以来的 Intel CPU。)

Recent Intel CPUs can macro-fuse some instructions (like and and add / sub ) as well as test and cmp , but or is not one of them.最近的英特尔 CPU 可以宏融合一些指令(如andadd / sub )以及testcmp ,但or不是其中之一。 AMD CPUs can only merge test and cmp with a JCC. AMD CPU 只能将testcmp与 JCC 合并。 See x86_64 - Assembly - loop conditions and out of order , or just refer directly to Agner Fog's microarch docs for the details of which CPU can macro-fuse what.请参阅x86_64 - Assembly - loop conditions and out of order ,或者直接参考Agner Fog 的 microarch 文档,了解哪个 CPU 可以宏融合什么的详细信息。 test can macro-fuse in some cases where cmp can't, eg with js . test可以在cmp不能的某些情况下进行宏融合,例如使用js

Almost all simple ALU ops (bitwise boolean, add/sub, etc.) run in a single cycle.几乎所有简单的 ALU 操作(按位布尔值、加/减等)都在一个周期内运行。 They all have the same "cost" in tracking them through the out-of-order execution pipeline.它们在通过乱序执行管道跟踪它们时都具有相同的“成本”。 Intel and AMD spend the transistors to make fast execution units to add/sub/whatever in a single cycle.英特尔和 AMD 使用晶体管来制造快速执行单元,以在单个周期内添加/减少/进行任何操作。 Yes, bitwise OR or AND is simpler, and probably uses slightly less power, but still can't run any faster than one clock cycle.是的,按位ORAND更简单,并且可能使用稍少的功率,但仍然不能比一个时钟周期运行得更快。


or reg, reg adds another cycle of latency to the dependency chain for following instructions that need to read the register. or reg, reg为依赖链添加了另一个延迟周期,用于后续需要读取寄存器的指令。 It's an x |= x in the chain of operations that lead to the value you want.它是导致您想要的值的操作链中的x |= x


You might think that extra register write would also need an extra physical register-file (PRF) entry vs. test , but that's probably not the case.您可能认为额外的寄存器写入与test相比还需要额外的物理寄存器文件 (PRF) 条目,但事实可能并非如此。 (See https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ for more about PRF capacity impact on out-of-order exec). (有关 PRF 容量对乱序执行的影响的更多信息,请参阅https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ )。

test has to produce its FLAGS output somewhere. test必须在某处产生其 FLAGS 输出。 On Intel Sandybridge-family CPUs at least, when an instruction produces a register and a FLAGS result, both of them are stored together in the same PRF entry.至少在 Intel Sandybridge 系列 CPU 上,当一条指令产生一个寄存器和一个 FLAGS 结果时,它们会一起存储在同一个 PRF 条目中。 (Source: an Intel patent I think. This is from memory but seems like an obviously sane design.) (来源:我认为是英特尔的专利。这是来自记忆,但似乎是一个明显理智的设计。)

An instruction like cmp or test that only produces a FLAGS result also needs a PRF entry for its output.cmptest这样产生 FLAGS 结果的指令也需要一个 PRF 条目作为其输出。 Presumably this is slightly worse : the old physical register is still "alive", referenced as the holder of the value of the architectural register written by some older instruction.大概这稍微更糟:旧的物理寄存器仍然“活着”,被引用为一些旧指令写入的架构寄存器的值的持有者。 And now architectural EFLAGS (or more specifically, both the separate-renamed CF and SPAZO flag groups) point to this new physical register in the RAT (register allocation table) updated by the renamer.现在架构 EFLAGS(或更具体地说,分别重命名的 CF 和 SPAZO 标志组)指向由重命名器更新的 RAT(寄存器分配表)中的这个新物理寄存器。 Of course, the next FLAGS-writing instruction will overwrite that, allowing that PR to be freed once all its readers have read it and executed.当然,下一个 FLAGS 写入指令将覆盖它,一旦所有的读者都读取并执行了该 PR,就可以释放它。 This is not something I think about when optimizing, and I don't think tends to matter in practice.这不是我在优化时考虑的事情,我认为在实践中并不重要。


P6-family register-read stalls: possible upside to or reg,reg P6-family register-read 档:可能上升到or reg,reg

P6-family CPUs (PPro / PII to Nehalem) have a limited number of register-read ports for the issue/rename stage to read "cold" values (not forwarded from an in-flight instruction) from the permanent register file, but recently-written values are available directly from the ROB. P6 系列 CPU(PPro / PII 到 Nehalem)具有数量有限的寄存器读取端口,用于发布/重命名阶段从永久寄存器文件读取“冷”值(不是从执行中的指令转发),但最近- 写入的值可直接从 ROB 获得。 Rewriting a register unnecessarily can make it live in the forwarding network again to help avoid register-read stalls.不必要地重写寄存器可以使其再次存在于转发网络中,以帮助避免寄存器读取停顿。 (See Agner Fog's microarch pdf ). (参见Agner Fog 的 microarch pdf )。

Re-writing a register with the same value on purpose to keep it "hot" can actually be an optimization for some cases of surrounding code, on P6.故意用相同的值重写寄存器以使其保持“热”实际上可以是对 P6 上某些周围代码情况的优化。 Early P6 family CPUs couldn't do macro-fusion at all, so you aren't even missing out on that by using and reg,reg instead of test .早期的 P6 系列 CPU 根本无法进行宏融合,因此使用and reg,reg而不是test ,您甚至不会错过这一点。 But Core 2 (in 32-bit mode) and Nehalem (in any mode) can macro-fuse test/jcc so you're missing out on that.但是 Core 2(在 32 位模式下)和 Nehalem(在任何模式下)可以对 test/jcc进行宏融合,因此您错过了这一点。

( and is equivalent to or for this purpose on P6-family, but less bad if your code ever runs on a Sandybridge-family CPU: it can macro-fuse and / jcc but not or / jcc . The extra cycle of latency in the dep-chain for the register is still a disadvantage on P6, especially if the critical path involving it is the main bottleneck.) and相当于or为此目的在 P6 系列上,但如果您的代码曾经在 Sandybridge 系列 CPU 上运行,则不那么糟糕:它可以宏熔断and / jcc但不能or / jcc 。额外的延迟周期在寄存器的 dep-chain 在 P6 上仍然是一个劣势,尤其是当涉及它的关键路径是主要瓶颈时。)

P6 family is very much obsolete these days (Sandybridge replaced it in 2011), and CPUs before Core 2 (Core, Pentium M, PIII, PII, PPro) are very obsolete and getting into retrocomputing territory, especially for anything where performance matters. P6 系列现在已经非常过时了(Sandybridge 在 2011 年取代了它),而 Core 2 之前的 CPU(Core、Pentium M、PIII、PII、PPro)已经非常过时并且进入了逆向计算领域,尤其是对于任何性能至关重要的领域。 You can ignore P6-family when optimizing unless you have a specific target machine in mind (eg if you have a crusty old Nehalem Xeon machine) or you're tuning a compiler's -mtune=nehalem settings for the few users still left.您可以在优化时忽略 P6 系列,除非您有一个特定的目标机器(例如,如果您有一个老旧的 Nehalem Xeon 机器)或者您正在为仍然剩下的少数用户调整编译器的-mtune=nehalem设置。

If you're tuning something to be fast on Core 2 / Nehalem, use test unless profiling shows that register-read stalls are a big problem in a specific case, and using and actually fixes it.如果您要在 Core 2 / Nehalem 上快速调整某些内容,请使用test除非分析表明寄存器读取停顿在特定情况下是一个大问题,并且使用and实际修复了它。

On earlier P6-family, and reg,reg might be ok as your default code-gen choice when the value isn't part of a problematic loop-carried dep chain, but is read later.在早期的 P6 系列上,当值不是有问题的循环携带的 dep 链的一部分但稍后读取时, and reg,reg可能可以作为您的默认代码生成选择。 Or if it is, but there's also a specific register-read stall that you can fix with and reg,reg .或者,如果是,但还有一个特定的寄存器读取停顿,您可以使用and reg,reg进行修复。

If you only want to test the low 8 bits of a full register, test al,al avoids writing a partial-register, which on P6-family is renamed separately from the full EAX/RAX.如果您只想测试完整寄存器的低 8 位,则test al,al避免编写部分寄存器,在 P6 系列上,它与完整 EAX/RAX 分开重命名。 or al,al is much worse if you later read EAX or AX: partial-register stall on P6-family. or al,al如果您稍后阅读 EAX 或 AX:P6 系列上的部分寄存器停顿or al,al则情况会更糟。 ( Why doesn't GCC use partial registers? ) 为什么 GCC 不使用部分寄存器?


History of the unfortunate or reg,reg idiom不幸的历史or reg,reg成语

The or reg,reg idiom may have came from 8080 ORA A , as pointed out in a comment .正如评论中指出的那样 or reg,reg成语可能来自 8080 ORA A

8080's instruction set doesn't have a test instruction, so your choices for setting flags according to a value included ORA A and ANA A . 8080 的指令集没有test指令,因此您根据值设置标志的选择包括ORA AANA A (Notice that the A register destination is baked in to the mnemonic for both those instructions, and there aren't instructions to OR into different registers: it's a 1-address machine except for mov , while 8086 is a 2-address machine for most instructions.) (请注意, A寄存器目的地已被纳入这两条指令的助记符中,并且没有指令将 OR 放入不同的寄存器:除了mov之外,它是一个 1 地址机器,而8086 是大多数情况下的 2 地址机器指示。)

8080 ORA A was the usual go-to way to do it, so presumably that habit carried over into 8086 assembly programming as people ported their asm sources. 8080 ORA A是通常的ORA A方式,因此当人们移植他们的 asm 源代码时,这种习惯大概会延续到 8086 汇编编程中。 (Or used automatic tools; 8086 was intentionally designed for easy / automatic asm-source porting from 8080 code .) (或使用自动工具; 8086 是专门为从 8080 代码轻松/自动移植 asm 源代码而设计的。)

This bad idiom continues to be blindly used by beginners, presumably taught by people who learned it back in the day and passed it on without thinking about the obvious critical path latency downside for out-of-order execution.这个糟糕的习语继续被初学者盲目使用,大概是由那些在当天学习它并传递它的人传授的,而没有考虑无序执行的明显关键路径延迟缺点。 (Or the other more subtle problems like no macro-fusion.) (或者其他更微妙的问题,比如没有宏观融合。)


Delphi's compiler reportedly uses or eax,eax , which was maybe a reasonable choice at the time (before Core 2), assuming that register-read stalls were more important than lengthening the dep chain for whatever reads it next. 据报道, Delphi 的编译器使用or eax,eax ,这在当时(在 Core 2 之前)可能是一个合理的选择,假设寄存器读取停顿比为接下来读取的任何内容延长 dep 链更重要。 IDK if that's true or they were just using the ancient idiom without thinking about it. IDK,如果这是真的,或者他们只是在使用古老的成语而没有考虑它。

Unfortunately, compiler-writers at the time didn't know the future, because and eax,eax performs exactly equivalently to or eax,eax on Intel P6-family, but is less bad on other uarches because and can macro-fuse on Sandybridge-family.不幸的是,当时的编译器编写者不知道未来,因为or eax,eax在 Intel P6 系列上的性能and eax,eax完全相同,但在其他 uarches 上没有那么糟糕,因为and可以在 Sandybridge 上进行宏融合-家庭。 (See the P6 section above). (请参阅上面的 P6 部分)。


Value in memory: maybe use cmp or load it into a reg.内存中的值:可能使用cmp或将其加载到 reg 中。

To test a value in memory , you can cmp dword [mem], 0 , but Intel CPUs can't macro-fuse flag-setting instructions that have both an immediate and a memory operand.要测试 memory 中的值,您可以cmp dword [mem], 0 ,但 Intel CPU 不能宏融合具有立即数和内存操作数的标志设置指令。 If you're going to use the value after the compare in one side of the branch, you should mov eax, [mem] / test eax,eax or something.如果你打算在分支的一侧使用比较后的值,你应该mov eax, [mem] / test eax,eax或其他东西。 If not, either way is 2 front-end uops, but it's a tradeoff between code-size and back-end uop count.如果不是,无论哪种方式都是 2 个前端 uop,但这是代码大小和后端 uop 计数之间的权衡。

Although note that some addressing modes won't micro-fuse either on SnB-family : RIP-relative + immediate won't micro-fuse in the decoders, or an indexed addressing mode will un-laminate after the uop-cache.尽管请注意,某些寻址模式不会在 SnB 系列上进行微熔断:RIP 相关 + 立即数不会在解码器中进行微熔断,或者索引寻址模式将在 uop 缓存后取消分层。 Either way leading to 3 fused-domain uops for cmp dword [rsi + rcx*4], 0 / jne or [rel some_static_location] .无论哪种方式都会导致cmp dword [rsi + rcx*4], 0 / jne[rel some_static_location] 3 个融合域[rel some_static_location]

On i7-6700k Skylake (tested with perf events uops_issued.any and uops_executed.thread ):在 i7-6700k Skylake 上(使用性能事件uops_issued.anyuops_executed.thread进行测试):

  • mov reg, [mem] (or movzx ) + test reg,reg / jnz 2 uops in both fused and unfused domains, regardless of addressing mode, or movzx instead of mov. mov reg, [mem] (或movzx )+ test reg,reg / jnz融合域和未融合域中的 2 test reg,reg / jnz ,无论寻址模式如何,或movzx而不是 mov。 Nothing to micro-fuse;没有什么可以微熔断的; does macro-fuse.做宏保险丝。
  • cmp byte [rip+static_var], 0 + jne . cmp byte [rip+static_var], 0 + jne 3 fused, 3 unfused. 3 融合,3 未融合。 (front and back ends). (前端和后端)。 The RIP-relative + immediate combination prevents micro-fusion. RIP 相关 + 立即组合可防止微融合。 It also doesn't macro-fuse.它也没有宏熔断器。 Smaller code-size but less efficient.代码量较小但效率较低。
  • cmp byte [rsi + rdi], 0 (indexed addr mode) / jne 3 fused, 3 unfused. cmp byte [rsi + rdi], 0 (indexed addr mode) / jne 3 已融合,3 未融合。 Micro-fuses in the decoders, but un-laminates at issue/rename.解码器中的微保险丝,但在问题/重命名时取消层压。 Doesn't macro-fuse.不宏熔断器。
  • cmp byte [rdi + 16], 0 + jne 2 fused, 3 unfused uops. cmp byte [rdi + 16], 0 + jne 2 融合,3 未融合 uops。 Micro-fusion of cmp load+ALU did happen because of the simple addressing mode, but the immediate prevents macro-fusion. cmp load+ALU 的微融合确实因为简单的寻址方式而发生,但立即阻止了宏融合。 About as good as load + test + jnz: smaller code-size but 1 extra back-end uop.与负载 + 测试 + jnz 一样好:代码量更小,但有 1 个额外的后端 uop。

If you have a 0 in a register (or a 1 if you want to compare a bool), you can cmp [mem], reg / jne for even fewer uops, as low as 1 fused-domain, 2 unfused.如果寄存器中有0 (或者如果要比较 bool 则为1 ),您可以使用cmp [mem], reg / jne以获得更少的 uops,低至 1 个融合域,2 个未融合。 But RIP-relative addressing modes still don't macro-fuse.但是 RIP 相对寻址模式仍然没有宏熔断器。

Compilers tend to use load + test/jcc even when the value isn't used later.即使稍后不使用该值,编译器也倾向于使用 load + test/jcc 。

You could also test a value in memory with test dword [mem], -1 , but don't.您还可以使用test dword [mem], -1内存中的值,但不要这样做。 Since test r/m16/32/64, sign-extended-imm8 isn't available, it's worse code-size than cmp for anything larger than bytes.由于test r/m16/32/64, sign-extended-imm8不可用,对于大于字节的任何内容,它的代码大小比cmp更糟糕。 (I think the design idea was that if you you only want to test the low bit of a register, just test cl, 1 instead of test ecx, 1 , and use cases like test ecx, 0xfffffff0 are rare enough that it wasn't worth spending an opcode. Especially since that decision was made for 8086 with 16-bit code, where it was only the difference between an imm8 and imm16, not imm32.) (我认为设计思想是,如果您只想测试寄存器的低位,只需test cl, 1而不是test ecx, 1 ,并且像test ecx, 0xfffffff0这样的用例很少见,它不是值得花一个操作码。特别是因为这个决定是为 16 位代码的 8086 做出的,它只是 imm8 和 imm16 之间的区别,而不是 imm32。)

(I wrote -1 rather than 0xFFFFFFFF so it would be the same with byte or qword . ~0 would be another way to write it.) (我写了 -1 而不是 0xFFFFFFFF 所以它与byteqword相同。 ~0将是另一种写法。)

Related:有关的:

It depends on the exact code sequence, which specific CPU it is, and other factors.这取决于确切的代码序列、它是哪个特定的 CPU 以及其他因素。

The main problem with or al, al, is that it "modifies" EAX , which means that a subsequent instruction that uses EAX in some way may stall until this instruction completes. or al, al,的主要问题是它“修改”了EAX ,这意味着以某种方式使用EAX的后续指令可能会停止,直到该指令完成。 Note that the conditional branch ( jz ) also depends on the instruction, but CPU manufacturers do a lot of work (branch prediction and speculative execution) to mitigate that.请注意,条件分支 ( jz ) 也取决于指令,但 CPU 制造商做了大量工作(分支预测和推测执行)来缓解这种情况。 Also note that in theory it would be possible for a CPU manufacturer to design a CPU that recognises EAX isn't changed in this specific case, but there are hundreds of these special cases and the benefits of recognising most of them are too little.另请注意,理论上 CPU 制造商可能会设计出一种 CPU,该 CPU 在这种特定情况下不会更改EAX ,但这种特殊情况有数百种,识别其中大部分的好处太少。

The main problem with cmp al,0 is that it's slightly larger, which might mean slower instruction fetch/more cache pressure, and (if it is a loop) might mean that the code no longer fits in some CPU's "loop buffer". cmp al,0的主要问题是它稍大,这可能意味着指令获取速度较慢/缓存压力更大,并且(如果是循环)可能意味着代码不再适合某些 CPU 的“循环缓冲区”。

As Jester pointed out in comments;正如 Jester 在评论中指出的那样; test al,al avoids both problems - it's smaller than cmp al,0 and doesn't modify EAX . test al,al避免了这两个问题——它小于cmp al,0并且不修改EAX

Of course (depending on the specific sequence) the value in AL must've come from somewhere, and if it came from an instruction that set flags appropriately it might be possible to modify the code to avoid using another instruction to set flags again later.当然(取决于特定的序列) AL的值必须来自某个地方,如果它来自适当设置标志的指令,则可能会修改代码以避免使用另一条指令稍后再次设置标志。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM