CPU如何从内存中检索多字节

Question

Hi I just a newbie to assembly programming. 嗨，我只是汇编编程的新手。 I'm confused how CPU retrieve multibyte (eg 32 bits for 32 bit machine) from memory. 我很困惑CPU如何从内存中检索多字节（例如32位机器为32位）。 Let say we have an integer i which occupies 4 bytes in memory (starting address at 0x100) so when we use IA32 assembly programming, we just write something like: 假设我们有一个整数i，它在内存中占据4个字节（起始地址为0x100），因此当我们使用IA32汇编编程时，我们只写类似以下内容：

movl 8(%esp), %eax

where esp is current stack pointer. esp是当前堆栈指针。 8 is just the offset from the stack pointer address to variable i so when the ia32 instruction executes, cpu just retrieve the byte at 0x100, what about the rest of bytes at 0x101, 0x102, 0x103? 8只是从堆栈指针地址到变量i的偏移量，因此当执行ia32指令时，cpu只是检索0x100处的字节，其余的0x101、0x102、0x103处的字节呢？ How CPU retrieve 32 bits all in once? CPU如何一次检索全部32位？

Edited: new questions I think I was fundamental wrong on the understanding the word size. 编辑：新问题 我认为我在理解单词大小上是根本错误的。 But I am still confused but how does 32 bits machine retrieve long integer which is 8 bytes 64 bit, maybe using movq but again what about accessing an objects which is 256 bytes? 但是我仍然很困惑，但是32位计算机如何检索8字节64位的长整数，也许使用movq，但是又如何访问256字节的对象呢？ does CPU just issues movq 4 times? CPU只会发出movq 4次吗？ how does cpu know in advance that how many time it need to issue a mov command to retrieve the large size of object? cpu如何事先知道发出mov命令以检索大尺寸对象需要多少时间？

Answer 1

how does 32 bits machine retrieve long integer which is 8 bytes 64 bit 32位机器如何检索8字节64位长整数

If you're doing it in integer registers, the compiler has to use multiple instructions , because the architecture doesn't provide an instruction to load two 32-bit registers at once. 如果要在整数寄存器中执行此操作， 则编译器必须使用多个指令 ，因为该体系结构没有提供一次加载两个32位寄存器的指令。 So the CPU just sees two separate load instructions. 因此，CPU仅看到两个单独的加载指令。

Consider these functions, compiled by gcc7.3 -O3 -m32 for 32-bit x86 , with args passed on the stack, and 64-bit integers returned in edx:eax (high half in EDX, low half in EAX). 考虑一下这些函数，这些函数由gcc7.3 -O3 -m32为32位x86编译，并在堆栈上传递了args，并且在edx:eax返回了64位整数（EDX中的高位，EAX中的低位）。 ie the i386 System V ABI. 即i386 System V ABI。

int64_t foo(int64_t a) {
    return a + 2;
}
    movl    4(%esp), %eax
    movl    8(%esp), %edx
    addl    $2, %eax
    adcl    $0, %edx                   # add-with-carry
    ret


int64_t bar(int64_t a, int64_t b) {
    return a + b;
}

    movl    12(%esp), %eax      # low half of b
    addl    4(%esp), %eax       # add low half of a
    movl    16(%esp), %edx
    adcl    8(%esp), %edx       # carry-in from low-half add
    ret

The CPU itself provides instructions that programmers / compilers can use when working with data larger than a register. CPU本身提供了指令，供程序员/编译器在处理大于寄存器的数据时使用。 The CPU only supports the widths that are part of the instruction set, not arbitrary width . CPU仅支持作为指令集一部分的宽度，而不支持任意width 。 This is why we have software. 这就是我们拥有软件的原因。

On x86, the compiler could instead have chosen to use movq into an XMM or MMX register, and used paddq , especially if this was part of a larger function that could store the 64-bit result somewhere in memory instead of needing it in integer registers. 在x86上，编译器可以选择使用movq到XMM或MMX寄存器中，并使用paddq ，特别是如果这是一个更大的函数的一部分，该函数可以将64位结果存储在内存中的某个地方，而不需要在整数寄存器中使用。 But this only works up to the limit of what you can do with vector registers, and they only support elements up to 64 bits wide. 但这只能达到向量寄存器的极限，并且它们仅支持最大64位宽的元素。 There's no 128-bit addition instruction. 没有128位加法指令。

how does cpu know in advance that how many time it need to issue a mov command to retrieve the large size of object? cpu如何事先知道发出mov命令以检索大尺寸对象需要多少时间？

The CPU only has to execute every instruction exactly once, in program order. CPU只需按程序顺序执行一次每条指令即可。 (Or do whatever it wants internally to give the illusion of doing this). （或者在内部做任何想做的错觉）。

An x86 CPU has to know how to decode any possible x86 instruction into the right internal operations. x86 CPU必须知道如何将任何可能的x86指令解码为正确的内部操作。 If the CPU can only load 128 bits at a time, it has to decode a 256-bit vector load like vmovups (%edi), %ymm0 into multiple load operations internally (like AMD does). 如果CPU一次只能加载128位，则它必须将256位矢量加载（例如vmovups (%edi), %ymm0为内部多个加载操作（就像AMD一样）。 See David Kanter's write-up on the Bulldozer microarchitecture . 请参阅David Kanter关于Bulldozer微体系结构的文章。

Or it could decode it to a special load operation that takes two cycles in the load port (like Sandybridge), so 256-bit loads/stores don't cost extra front-end bandwidth, only extra time in the load / store ports. 或者可以将其解码为特殊的加载操作，该操作在加载端口（如Sandybridge）中需要两个周期，因此256位加载/存储不会花费额外的前端带宽，而只会在加载/存储端口中花费额外的时间。

Or if its internal data path from L1d cache to execution units is wide enough (Haswell and later), it can decode to a single simple load uop that is handled internally by the cache / load port very much like mov (%edi), %eax , or especially vmovd (%edi), %xmm0 (a 32-bit zero-extending load into a vector register). 或者，如果它从L1d缓存到执行单元的内部数据路径足够宽（Haswell和更高版本），它可以解码为一个简单的加载uop，由缓存/加载端口在内部进行处理，就像mov (%edi), %eax或特别是vmovd (%edi), %xmm0 （将32位零扩展加载到向量寄存器中）。

256 bytes is 32 qwords; 256 个字节是32个qword; no current x86 CPUs can load that much in a single operation. 当前的x86 CPU在单个操作中无法负载那么多。

256 bits is 4 qwords, or one AVX ymm register. 256 位是4个ymm或一个AVX ymm寄存器。 Modern Intel CPUs (Haswell and later) have internal data paths that wide, and really can transfer 256 bits at once from cache to a vector load execution unit, executing vmovups ymm0, [rdi] as a single uop. 现代的Intel CPU（Haswell和更高版本）具有很宽的内部数据路径，并且实际上可以一次将256位从高速缓存传输到向量加载执行单元，以单个vmovups ymm0, [rdi]执行vmovups ymm0, [rdi] 。 See How can cache be that fast? 请参阅如何快速地进行缓存？ for more details about how wide loads from cache give extremely high throughput / bandwidth for L1d cache. 有关缓存的宽负载如何为L1d缓存提供极高的吞吐量/带宽的更多详细信息。

Answer 2

In general CPUs can load multiple bytes from memory because they are designed to do so and their ISA supports it. 通常，CPU可以从内存中加载多个字节，因为它们被设计为可以加载，而ISA支持。

For example, their registers, internal buses, caching design and memory subsystem is designed to do so. 例如，它们的寄存器，内部总线，缓存设计和内存子系统就是为此目的而设计的。 Physically a processor capable of loading 64-bit values may have 64 parallel wires in various places to move 64-bits (8 bytes) around the CPU - but other designs are possible, such as a smaller bus of 16-bits that transfers two bytes at a time, or even a bit-serial point-to-point connection which transmits bits one at a time. 从物理上讲，能够加载64位值的处理器可能在不同位置具有64条并行线以在CPU周围移动64位（8字节）-但其他设计也是可行的，例如较小的16位总线可以传输两个字节一次，甚至是一次串行传输一个位的位串行点对点连接。 Different parts of the same CPU may use different designs and different physical widths. 同一CPU的不同部分可能使用不同的设计和不同的物理宽度。 For example, reading N bits from DRAM may be implemented as reading M bits in parallel from C chips, with the results merged at the memory controller so the chips need to support a lesser degree of parallelism than other parts of the core to memory path. 例如，从DRAM读取N位可以实现为从C芯片并行读取M位，结果在存储器控制器处合并，因此与内核到存储路径的其他部分相比，芯片需要支持的并行度更低。

The width inherently supported by the ISA may differ from the natural width implemented by the hardware. ISA固有支持的宽度可能与硬件实现的自然宽度不同。 For example, when Intel added the AVX ISA extension, which was the first to support 256-bit (16-byte) loads and stores, the underlying hardware initially implemented this as a pair of 128-bit operations. 例如，当英特尔添加AVX ISA扩展，这是第一个支持256位（16字节）加载和存储的扩展时，底层硬件最初将其实现为一对128位操作。 Later CPU architectures (Haswell) finally implemented this as full 256-bit width operations. 后来的CPU体系结构（Haswell）最终将其实现为完整的256位宽度的操作。 Even today lower-cost x86 chips may split up large load/store operations into smaller units. 即使在今天，成本较低的x86芯片也可能会将大型装载/存储操作拆分为较小的单元。

Ultimately, these are all internal details of the CPU. 最终，这些都是CPU的内部细节。 What you can rely on is the documented behavior, such as what size of values can be loaded atomically, or for CPUs that document it, how long it takes to load values of types. 您可以依靠的是已记录的行为，例如可以自动加载多少大小的值，或者对于记录了该值的CPU，加载类型值需要多长时间。 How it is implemented internally is more of an electrical engineering/CPU design question and there are many ways to do it. 在内部如何实现它更多是电气工程/ CPU设计问题，并且有很多方法可以实现。

CPU如何从内存中检索多字节

问题描述

2 个解决方案

解决方案1
3 已采纳 2018-04-20 01:27:07

解决方案2
1 2018-04-20 01:20:16

CPU如何从内存中检索多字节

问题描述

2 个解决方案

解决方案1 3 已采纳 2018-04-20 01:27:07

解决方案2 1 2018-04-20 01:20:16

解决方案1
3 已采纳 2018-04-20 01:27:07

解决方案2
1 2018-04-20 01:20:16