简体繁体 English

为什么 Mac ABI 对于 x86-32 需要 16 字节堆栈 alignment？

[英]Why does the Mac ABI require 16-byte stack alignment for x86-32?

原文 2009-03-04 21:12:41 5 10 macos/ memory-alignment/ callstack/ calling-convention/ abi

I can understand this requirement for the old PPC RISC systems and even for x86-64, but for the old tried-and-true x86?我可以理解旧 PPC RISC 系统甚至 x86-64 的这一要求，但对于久经考验的旧 x86？ In this case, the stack needs to be aligned on 4 byte boundaries only.在这种情况下，堆栈只需要在 4 字节边界上对齐。 Yes, some of the MMX/SSE instructions require 16byte alignments, but if that is a requirement of the callee, then it should ensure the alignments are correct.是的，一些 MMX/SSE 指令需要 16 字节对齐，但如果这是被调用者的要求，那么它应该确保对齐正确。 Why burden every caller with this extra requirement?为什么要让每个调用者都负担这个额外的要求？ This can actually cause some drops in performance because every call-site must manage this requirement.这实际上会导致性能下降，因为每个呼叫站点都必须管理此要求。 Am I missing something?我错过了什么吗？

Update: After some more investigation into this and some consultation with some internal colleagues, I have some theories about this:更新：在对此进行了更多调查并咨询了一些内部同事之后，我对此有一些理论：

Consistency between the PPC, x86, and x64 version of the OS PPC、x86 和 x64 版本操作系统之间的一致性
It seems that the GCC codegen now consistently does a sub esp,xxx and then "mov"s the data onto the stack rather than simply doing a "push" instruction.似乎 GCC 代码生成器现在始终执行 sub esp,xxx 然后将数据“移动”到堆栈上，而不是简单地执行“推送”指令。 This could actually be faster on some hardware.这实际上在某些硬件上可能更快。
While this does complicate the call sites a little, there is very little extra overhead when using the default "cdecl" convention where the caller cleans up the stack.虽然这确实使调用站点有点复杂，但在使用调用者清理堆栈的默认“cdecl”约定时几乎没有额外的开销。

The issue I have with the last item, is that for calling conventions that rely on the callee cleaning the stack, the above requirements really "uglifies" the codegen.我对最后一项的问题是，对于依赖于被调用方清理堆栈的调用约定，上述要求确实“丑化”了代码生成。 For instance, what some compiler decided to implement a faster register-based calling style for its own internal use (ie any code that isn't intended to be called from other languages or sources)?例如，某些编译器决定实现更快的基于寄存器的调用样式以供其内部使用（即任何不打算从其他语言或源调用的代码）？ This stack-alignment thing could negate some of the performance gains achieved by passing some parameters in registers.这种堆栈对齐的事情可能会抵消通过在寄存器中传递一些参数所获得的一些性能提升。

Update: So far the only real answers have been consistency, but to me that's a bit too easy of an answer.更新：到目前为止，唯一真正的答案是一致性，但对我来说，这有点太容易了。 I have well over 20 years experience with the x86 architecture and if consistency, not performance, or something else concrete, is really the reason then I respectfully suggest that is a bit naive for the developers to require it.我在 x86 架构方面拥有超过 20 年的经验，如果一致性，而不是性能或其他具体的东西，真的是原因，那么我恭敬地建议开发人员要求它有点天真。 They're ignoring nearly three decades of tools and support.他们忽略了近三年的工具和支持。 Especially if they're expecting tools vendors to quickly and easily adapt their tools for their platform (maybe not... it is Apple...) without having to jump through several seemingly unnecessary hoops.特别是如果他们希望工具供应商能够快速轻松地为他们的平台调整他们的工具（也许不是......它是苹果......），而不必跳过几个看似不必要的障碍。

I'll give this topic another day or so then close it...我会在另一天左右给出这个话题然后关闭它......

Related有关的

It's my stack frame, I don't care about your stack frame! 这是我的堆栈框架，我不在乎你的堆栈框架！

10 个解决方案

From "Intel®64 and IA-32 Architectures Optimization Reference Manual", section 4.4.2:来自“英特尔® 64 和 IA-32 架构优化参考手册”，第 4.4.2 节：

"For best performance, the Streaming SIMD Extensions and Streaming SIMD Extensions 2 require their memory operands to be aligned to 16-byte boundaries. Unaligned data can cause significant performance penalties compared to aligned data." “为了获得最佳性能，流式 SIMD 扩展和流式 SIMD 扩展 2 要求其 memory 操作数与 16 字节边界对齐。与对齐的数据相比，未对齐的数据可能会导致显着的性能损失。”

From Appendix D:来自附录 D：

"It is important to ensure that the stack frame is aligned to a 16-byte boundary upon function entry to keep local __m128 data, parameters, and XMM register spill locations aligned throughout a function invocation." “重要的是要确保堆栈帧在 function 条目时与 16 字节边界对齐，以在整个 function 调用中保持本地 __m128 数据、参数和 XMM 寄存器溢出位置对齐。”

http://www.intel.com/Assets/PDF/manual/248966.pdf http://www.intel.com/Assets/PDF/manual/248966.pdf

I am not sure as I don't have first hand proof, but I believe the reason is SSE.我不确定，因为我没有第一手证据，但我相信原因是 SSE。 SSE is much faster if your buffers are already aligned on a 16 bytes boundary (movps vs movups), and any x86 has at least sse2 for mac os x.如果您的缓冲区已经在 16 字节边界上对齐（movps 与 movups），那么 SSE 会快得多，并且任何 x86 对于 mac os x 至少具有 sse2。 It can be taken care of by the application user, but the cost is pretty significant.它可以由应用程序用户来处理，但成本非常高。 If the overall cost for making it mandatory in the ABI is not too significant, it may worth it.如果在 ABI 中强制执行它的总成本不是太高，那么它可能是值得的。 SSE is used quite pervasively in mac os X: accelerate framework, etc... SSE 在 mac os X 中非常普遍地使用：加速框架等......

I believe it's to keep it inline with the x86-64 ABI.我相信这是为了让它与 x86-64 ABI 保持一致。

First, note that the 16 bytes alignment is an exception introduced by Apple to the System V IA-32 ABI.首先，请注意 16 字节 alignment 是 Apple 引入 System V IA-32 ABI 的一个例外。

The stack alignment is only needed when calling system functions, because many system libraries are using SSE or Altivec extensions which require the 16 bytes alignment.只有在调用系统函数时才需要堆栈 alignment，因为许多系统库正在使用需要 16 字节 alignment 的 SSE 或 Altivec 扩展。 I found an explicit reference in the libgmalloc MAN page .我在libgmalloc MAN 页面中找到了一个明确的参考。

You can perfectly handle your stack frame the way you want, but if you try to call a system function with a misaligned stack, you will end up with a misaligned_stack_error message.您可以按照您想要的方式完美地处理您的堆栈帧，但是如果您尝试使用未对齐的堆栈调用系统 function，您最终会收到一条misaligned_stack_error消息。

Edit: For the record, you can get rid of alignment problems when compiling with GCC by using the mstack-realign option.编辑：作为记录，您可以使用mstack-realign选项在使用 GCC 编译时摆脱 alignment 问题。

This is an efficiency issue.这是一个效率问题。

Making sure the stack is 16-byte aligned in every function that uses the new SSE instructions adds a lot of overhead for using those instructions, effectively reducing performance.确保使用新 SSE 指令的每个 function 中的堆栈是 16 字节对齐的，这会增加使用这些指令的大量开销，从而有效地降低性能。

On the other hand, keeping the stack 16-byte aligned at all times ensures that you can use SSE instructions freely with no performance penalty.另一方面，始终保持堆栈 16 字节对齐可确保您可以自由使用 SSE 指令而不会降低性能。 There is no cost to this (cost measured in instructions at least).这没有成本（至少在说明中衡量的成本）。 It only involves changing a constant in the prologue of the function.它只涉及更改 function 序言中的常数。

Wasting stack space is cheap, it is probably the hottest part of the cache.浪费堆栈空间很便宜，它可能是缓存中最热的部分。

My guess is that Apple believes everyone just uses XCode (gcc) which aligns the stack for you.我的猜测是，Apple 相信每个人都只使用 XCode (gcc) 来为您对齐堆栈。 So requiring the stack to be aligned so the kernel doesn't have to is just a micro-optimization.因此，要求堆栈对齐以便 kernel 不必只是一个微优化。

Hmm, didn't OS X ABI also do funny RISC like things like passing small structs in registers?嗯，OS X ABI 不是也做有趣的 RISC，比如在寄存器中传递小结构吗？

So that points to the consistency with other platforms theory.这表明与其他平台理论的一致性。

Come to think of it, the FreeBSD syscall api also aligns 64-bit values.想想看，FreeBSD 系统调用 api 也对齐 64 位值。 (like eg lseek and mmap) （例如 lseek 和 mmap）

While I cannot really answer your question of WHY, you may find the manuals at the following site useful:虽然我无法真正回答您的问题，但您可能会发现以下站点上的手册很有用：

http://www.agner.org/optimize/ http://www.agner.org/optimize/

Regarding the ABI, have a look especially at:关于 ABI，请特别查看：

http://www.agner.org/optimize/calling_conventions.pdf http://www.agner.org/optimize/calling_conventions.pdf

Hope that's useful.希望这很有用。

In order to maintain consistency in kernel.为了保持 kernel 中的一致性。 This allows the same kernel to be booted on multiple architectures without modicfication.这允许相同的 kernel 无需修改即可在多个架构上启动。

Not sure why no one has considered the possibility of easy portability from legacy PowerPC-based platform?不知道为什么没有人考虑过从传统的基于 PowerPC 的平台轻松移植的可能性？

Read this:读这个：

http://developer.apple.com/library/mac/#documentation/DeveloperTools/Conceptual/LowLevelABI/100-32-bit_PowerPC_Function_Calling_Conventions/32bitPowerPC.html#//apple_ref/doc/uid/TP40002438-SW20 http://developer.apple.com/library/mac/#documentation/DeveloperTools/Conceptual/LowLevelABI/100-32-bit_PowerPC_Function_Calling_Conventions/32bitPowerPC.html#//apple_ref/doc/uid/TP40002438-SW20

And then zoomed into "32-bit PowerPC Function Calling Conventions" and finally this:然后放大到“32-bit PowerPC Function Calling Conventions”，最后是这个：

"These are the embedding alignment modes available in the 32-bit PowerPC environment: “这些是 32 位 PowerPC 环境中可用的嵌入 alignment 模式：

Power alignment mode is derived from the alignment rules used by the IBM XLC compiler for the AIX operating system. Power alignment 模式源自 IBM XLC 编译器用于 AIX 操作系统的 alignment 规则。 It is the default alignment mode for the PowerPC-architecture version of GCC used on AIX and Mac OS X. Because this mode is most likely to be compatible between PowerPC-architecture compilers from different vendors, it's typically used with data structures that are shared between different programs."这是在 AIX 和 Mac OS X 上使用的 GCC 的 PowerPC 架构版本的默认 alignment 模式。因为这种模式最有可能在不同供应商的 PowerPC 架构编译器之间兼容，所以它通常与在不同供应商之间共享的数据结构一起使用不同的节目。”

In view of the legacy PowerPC-based background of OSX, portability is a major consideration - it dictates following the convention all the way back to AIX's XLC compiler.鉴于 OSX 遗留的基于 PowerPC 的背景，可移植性是一个主要的考虑因素——它要求一直遵循 AIX 的 XLC 编译器的约定。 When you think in terms of the need to make sure all the tools and applications will work together with minimal rework, I think it is important to stick to the same legacy ABI as far as possible.当您考虑需要确保所有工具和应用程序能够以最少的返工协同工作时，我认为尽可能坚持使用相同的旧 ABI 是很重要的。

That gives the philosophy, and reading further is the rule explicitly mentioned ("Prolog and Epilog"):这给出了哲学，进一步阅读是明确提到的规则（“Prolog and Epilog”）：

The called function is responsible for allocating its own stack frame, making sure to preserve 16-byte alignment in the stack.被调用的 function 负责分配自己的栈帧，确保在栈中保留 16 字节的 alignment。 This operation is accomplished by a section of code called the prolog, which the compiler places before the body of the subroutine.此操作由一段称为 prolog 的代码完成，编译器将其放在子例程主体之前。 After the body of the subroutine, the compiler places an epilog to restore the processor to the state it was prior to the subroutine call.在子例程主体之后，编译器放置一个结语以将处理器恢复到子例程调用之前的 state。