简体   繁体   English

32位Intel处理器上的内存对齐

[英]Memory alignment on a 32-bit Intel processor

Intel's 32-bit processors such as Pentium have 64-bit wide data bus and therefore fetch 8 bytes per access. Intel的32位处理器(如Pentium)具有64位宽的数据总线,因此每次访问可获取8个字节。 Based on this, I'm assuming that the physical addresses that these processors emit on the address bus are always multiples of 8. 基于此,我假设这些处理器在地址总线上发出的物理地址总是8的倍数。

Firstly, is this conclusion correct? 首先,这个结论是否正确?

Secondly, if it is correct, then one should align data structure members on an 8 byte boundary. 其次,如果它是正确的,那么应该将数据结构成员对齐在8字节边界上。 But I've seen people using a 4-byte alignment instead on these processors. 但我见过人们在这些处理器上使用4字节对齐。

How can they be justified in doing so? 他们怎么能这样做呢?

The usual rule of thumb (straight from Intels and AMD's optimization manuals) is that every data type should be aligned by its own size. 通常的经验法则(直接来自英特尔和AMD的优化手册)是每种数据类型都应该按照自己的大小对齐。 An int32 should be aligned on a 32-bit boundary, an int64 on a 64-bit boundary, and so on. int32应该在32位边界上对齐,在64位边界上对应int64 ,依此类推。 A char will fit just fine anywhere. 一个char适合任何地方。

Another rule of thumb is, of course "the compiler has been told about alignment requirements". 另一个经验法则当然是“编译器已被告知对齐要求”。 You don't need to worry about it because the compiler knows to add the right padding and offsets to allow efficient access to data. 您无需担心它,因为编译器知道添加正确的填充和偏移以允许有效访问数据。

The only exception is when working with SIMD instructions, where you have to manually ensure alignment on most compilers. 唯一的例外是使用SIMD指令时,您必须手动确保大多数编译器的对齐。

Secondly, if it is correct, then one should align data structure members on an 8 byte boundary. 其次,如果它是正确的,那么应该将数据结构成员对齐在8字节边界上。 But I've seen people using a 4-byte alignment instead on these processors. 但我见过人们在这些处理器上使用4字节对齐。

I don't see how that makes a difference. 我不知道这有什么不同。 The CPU can simply issue a read for the 64-bit block that contains those 4 bytes. CPU可以简单地为包含这4个字节的64位块发出读取。 That means it either gets 4 extra bytes before the requested data, or after it. 这意味着它要么在请求的数据之前或之后获得4个额外的字节。 But in both cases, it only takes a single read. 但在这两种情况下,它只需要一次读取。 32-bit alignment of 32-bit-wide data ensures that it won't cross a 64-bit boundary. 32位数据的32位对齐确保它不会跨越64位边界。

Physical bus is 64bit wide ...multiple of 8 --> yes 物理总线是64位宽... 8的倍数 - >是

HOWEVER, there are two more factor to consider: 但是,还有两个因素需要考虑:

  1. Some x86 instruction set are byte addressed. 某些x86指令集是字节寻址的。 Some are 32bit aligned (that's why you have 4 byte thing). 有些是32位对齐的(这就是为什么你有4字节的东西)。 But no (core) instruction are 64bits aligned. 但是没有(核心)指令是64位对齐的。 The CPU can handle misaligned data access. CPU可以处理未对齐的数据访问。
  2. If you care about the performance, you should think about the cache line, not main memory. 如果你关心性能,你应该考虑缓存行,而不是主内存。 Cache lines are much wider. 缓存行更广泛。

They are justified in doing so because changing to 8-byte alignment would constitute an ABI change, and the marginal performance improvement is not worth the trouble. 他们这样做是有道理的,因为改为8字节对齐将构成ABI变化,并且边际性能改进不值得麻烦。

As someone else already said, cachelines matter. 正如其他人已经说过的,缓存行很重要。 All accesses on the actual memory bus are in terms of cache lines (64 bytes on x86, IIRC). 实际内存总线上的所有访问都是根据高速缓存行(x86上的64字节,IIRC)。 See the "What every programmer needs to know about memory" doc that was mentioned already. 请参阅已经提到的“每位程序员需要了解的关于内存的内容”文档。 So the actual memory traffic is 64 byte aligned. 所以实际的内存流量是64字节对齐的。

For random access and as long as the data is not misaligned (eg crossing a boundary), I don't think that it matters much; 对于随机访问,只要数据没有错位(例如越过边界),我认为这不重要; the correct address and offset in the data can be found with a simple AND construct in hardware. 可以使用硬件中的简单AND构造找到数据中的正确地址和偏移量。 It gets slow when one read access is not sufficient to get one value. 当一个读取访问不足以获得一个值时,它会变慢。 That's also why compilers usually put small values (bytes etc.) together because they don't have to be at a specific offset; 这也是编译器通常将小值(字节等)放在一起的原因,因为它们不必处于特定的偏移量; shorts should be on even addresses, 32-bit on 4-byte addresses and 64-bit on 8-byte addresses. 短路应该在偶数地址上,32位在4字节地址上,64位在8字节地址上。

Note that if you have caching involed and linear data access, things will be different. 请注意,如果您具有缓存调用和线性数据访问,则情况会有所不同。

The 64 bits bus you refer to feeds the caches. 您引用的64位总线为缓存提供信息。 As a CPU, always read and write entire cache lines. 作为CPU,始终读写整个缓存行。 The size of a cache line is always a multiple of 8, and its physical address is indeed aligned at 8 byte offsets. 高速缓存行的大小始终是8的倍数,并且其物理地址确实在8字节偏移处对齐。

Cache-to-register transfers do not use the external databus, so the width of that bus is irrelevant. 高速缓存到寄存器的传输不使用外部数据总线,因此该总线的宽度无关紧要。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM