简体   繁体   English

(x86)虚拟机通常如何处理标志?

[英]How do (x86) virtual machines generally handle flags?

For a side project, I am attempting to write a semi-programmable x86 virtual machine. 对于侧面项目,我试图编写一个半可编程的x86虚拟机。

I understand the formats so most of the design is relatively simple, but after executing an instruction with its operands, flags are often changed. 我理解格式,因此大部分设计都相对简单,但在执行带有操作数的指令后,标志经常会改变。 It would be very inefficient to check each potential bit, so I was thinking of popping the flag register into the VM, ANDing it, and then setting the VM's flag register. 检查每个潜在的位是非常低效的,所以我想把标志寄存器弹出到VM中,然后设置它,然后设置VM的标志寄存器。 However, this is still a lot of overhead. 但是,这仍然是很多开销。

This is borderline opinionated, but is there something I am missing? 这是一种立场的观点,但我有什么遗漏?

If you want your emulator to simulate the processor as it is, then yes, you need to emulate the flags exactly. 如果您希望模拟器按原样模拟处理器,那么是的,您需要完全模拟标志。

This means clearing bits that need to be cleared (with AND), setting bits that need to be set (with OR), and copying / calculating bits when required (ie the Z flag requires testing whether the result is zero, the carry requires to know whether you have an overflow, etc.) 这意味着清除需要清零的位(使用AND),设置需要设置的位(使用OR),以及在需要时复制/计算位(即Z标志需要测试结果是否为零,进位需要知道你是否有溢出等)

There is no way around it. 没有其他办法了。

This is just like decoding the R/M mod byte. 这就像解码R/M mod字节一样。 You have no way around having to load that byte, check the mode to determine whether this is a register or a memory access, and apply those accordingly... 您无法加载该字节,检查模式以确定这是寄存器还是内存访问,并相应地应用这些...

And in effect, that means your emulator will be "much slower" (unless you are emulating an old 10Mhz processor with a 3Ghz modern processor, when you anyway have time to execute 300 cycles of instructions... so you should be just fine.) 实际上,这意味着你的模拟器将“慢得多”(除非你用3Ghz现代处理器模拟旧的10Mhz处理器,当你无论如何都有时间执行300个周期的指令时...所以你应该没问题。 )

If you're interested, I wrote a 6502 emulator and got it tested with an Apple 2 ROM. 如果你有兴趣,我写了一个6502仿真器并用Apple 2 ROM进行了测试。 I had to add sleeps to not have it run at 100Mhz or more... (that processor was originally running 1Mhz...) 我不得不添加睡眠以使其无法以100Mhz或更高速度运行...(该处理器最初运行1Mhz ......)

You appear to be asking about emulating x86, not virtualizing it. 您似乎在询问模拟x86,而不是虚拟化它。 Since modern x86 hardware supports virtualization , where the CPU runs guest code natively and only traps to the hypervisor for some privileged instructions, that's what the term "virtualization" normally means. 由于现代x86硬件支持虚拟化 ,其中CPU本地运行访客代码并且仅针对某些特权指令捕获到管理程序,这就是术语“虚拟化”通常意味着什么。


Lazy flag evalution is typical . 懒惰的旗帜评估是典型的 Instead of actually calculating all the flags, just save the operands from the last instruction that set flags. 而不是实际计算所有标志,只需保存设置标志的最后一条指令的操作数。 Then if something actually reads the flags, figure out what the flag values need to be. 然后,如果某些内容实际上读取了标志,请确定标志值需要是什么。

This means you don't actually have to calculate PF and AF every time they're written (almost every instruction), only every time they're read (mostly only PUSHF or interrupts, hardly any code ever reads PF (except for FP branches where it means NaN)). 这意味着你每次写入时几乎不需要计算PF和AF(几乎每条指令),只有每次读取它们时(大多数只有PUSHF或中断,几乎没有任何代码读取PF(FP分支除外)这意味着NaN))。 Computing PF after every integer instruction is expensive in pure C, since it requires a popcount on the low 8 bits of results. 在每个整数指令之后计算PF在纯C中是昂贵的,因为它需要在低8位结果上弹出一个数量。 (And I think C compilers generally don't manage to recognize that pattern and use setp themselves, let alone a pushf or lahf to store multiple flags, if compiling an x86 emulator to run on an x86 host. They do sometimes recognize population-count patterns and emit popcnt instructions, though, when targetting host CPUs that have that feature (eg -march=nehalem )). (我认为C编译器一般不设法识别模式,并使用setp自己,更不用说pushflahf存储多个标志,如果编译一个x86模拟器在x86主机上运行。它们有时会认识人口数但是,当针对具有该功能的主机CPU(例如-march=nehalem )时,模式和发出popcnt指令。

BOCHS uses this technique, and describes the implementation in in some detail in the Lazy Flags section of this short pdf: How Bochs Works Under the Hood 2nd edition . BOCHS使用这种技术,并在这篇简短的pdf的Lazy Flags部分中详细描述了实现: Bochs如何在引擎盖第2版下工作 They save the result so they can derive ZF, SF, and PF, and the carry-out from the high 2 bits for CF and OF, and from bit 3 for AF. 它们保存结果,因此它们可以导出ZF,SF和PF,以及CF和OF的高2位进位,以及AF的3位进位。 With this, they never need to replay an instruction to compute its flag results. 有了这个,他们永远不需要重放指令来计算其标志结果。

There are extra complications from some instructions not writing all the flags (ie partial-flag updates), and presumably from instructions like BSF that set ZF based on the input not the output. 某些指令没有写入所有标志(即部分标志更新),并且可能来自BSF之类的指令,根据输入而不是输出设置ZF。


Further reading : 进一步阅读

This paper on emulators.com gives a lot of details on how to efficiently save enough state to reconstruct flags. emulators.com上的这篇论文给出了很多关于如何有效地保存足够的状态来重建标志的细节。 It has a "2.1 Lazy Arithmetic Flags for CPU Emulation". 它有一个“用于CPU仿真的2.1懒惰算术标志”。

One of the authors is Darek Mihocka (long time emulator writer, now working at Intel apparently). 其中一位作者是Darek Mihocka(长期模拟器作家,现在在英特尔工作)。 He has written much interesting stuff about making non-JIT emulators run fast, and CPU performance stuff in general, much of it posted on his site, http://www.emulators.com/ . 他编写了很多有趣的东西,关于使非JIT仿真器快速运行,以及一般的CPU性能,大部分都发布在他的网站http://www.emulators.com/上 Eg this article about avoiding branch-misprediction in an emulator's interpreter loop that dispatches to functions that implement each opcode is quite interesting. 例如, 本文关于避免模拟器的解释器循环中的分支错误预测,该循环调度到实现每个操作码的函数非常有趣。 Darek is also the co-author of that article about BOCHS internals I linked earlier. Darek也是我之前链接的关于BOCHS内部的文章的合着者。

A google hit for lazy flag eval may also be relevant: https://silviocesare.wordpress.com/2009/03/08/lazy-eflags-evaluation-and-other-emulator-optimisations/ 谷歌搜索懒惰的旗帜eval也可能是相关的: https//silviocesare.wordpress.com/2009/03/08/lazy-eflags-evaluation-and-other-emulator-optimisations/

Last time emulation of x86-like flags came up, the discussion in comments on my lazy-flags answer had some interesting stuff: eg @Raymond Chen suggested that link to the Mihocka & Troeger paper, and @amdn pointed out that JIT dynamic translation can produce faster emulation than interpretation. 上次仿真类似x86的标志出现了,我对懒惰标志的评论中讨论有一些有趣的东西:例如@Raymond Chen建议链接到Mihocka&Troeger论文,@ amdn指出JIT动态翻译可以产生比解释更快的仿真。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM