简体   繁体   中英

How does kernel know physical memory base address?

I'm trying to understand 2 closely related issues.

  1. Kernel code that runs post-bootloader and prior to enabling the MMU operates in physical/ identity mapped virtual memory. How is this code made portable between different CPU's might have DRAM in different physical address ranges?

  2. For the kernel to manage the page table, it needs some awareness of what physical memory resources are available, including the physical memory base address and available physical memory, so it doesn't assign physical addresses that are out of DRAM range.

I imagine this is somewhat implementation dependent, but references to how different architectures handle this problem would be appreciated. Some ideas I have so far:

  1. The physical address DRAM range, or at least the base address, is baked in at kernel compile time. This implies recompilation is needed for different CPU's even with the same ISA. This is inspired by this answer here , which, if I'm understanding correctly, describes the same solution for the kernel base address. Since the base address is known at compile time, kernel code references literal addresses rather than offsets from the DRAM/ kernel base address.

  2. DRAM information is read and learned from the device tree with the rest of the physical memory map. This is my impression for at least Xilinx Zynq SoC's, based on forum posts like this . While this solution offers more flexibility and allows us to just recompile the boot loader rather than the whole kernel to port CPU's, it does leave me wondering how my X86 personal machine can detect at run-time how much DRAM I've installed. Code to manage the page table just references offsets from the DRAM base address and is portable without recompilation across CPU's with different DRAM physical address ranges.

The entire physical memory DIMMs that are available at boot-time may not and typically are not mapped to a single contiguous range of the physical memory address space, so there is no "base address." On a hard reset, after the CPU firmware completes execution, the platform firmware, which is typically either legacy BIOS or UEFI is executed. A given motherboard is only compatible with a limited set of CPU collections that typically have the same method for discovering physical memory including DIMMs and the platform firmware memory device. An implementation of the platform firmware uses this method to build a table of memory description entries where each entry describes a physical memory address range. For more information on what this processor looks like, see: How Does BIOS initialize DRAM? . This table is stored at an address in main memory (DIMMs) that are known to be reserved for this purpose and are supposed to be backed by actual memory (a system may be booted without any DIMMs).

Most implementations of x86 PC BIOS since mid 90s offer the real-mode INT 15h E820h function (15h is the interrupt number and E820h is an argument passed in the AX register). This is a vendor-specific BIOS function first introduced in PhoenixBIOS v4.0 (1992-1994, I'm unable to pin down the exact year) and later adopted by other BIOS vendors. This interface was extended by the ACPI 1.0 specification released in 1996 and later revisions of PhoenixBIOS supported ACPI. The corresponding UEFI interface is GetMemoryMap() , which is a UEFI boot-time service (meaning that it can only be called at boot-time as defined in the UEFI specification). The kernel can use one of these interfaces to get the address map describing memory on all NUMA nodes. Other (older) methods on x86 platforms are discussed at Detecting Memory (x86) . Both the ACPI specification starting with version? and UEFI specification starting with version? support DRAM DIMMs and NVDIMMs memory range types.

Consider for example how the ACPI-compatible Linux kernel determines the what physical address ranges are available (ie, backed by actual memory) and usable (ie, free) on an x86 ACPI-capable BIOS platform. The BIOS firmware loads the bootloader from the specified bootable storage device to a memory location dedicated for this purpose. After the firmware completes execution, it jumps to the bootloader which will find the kernel image on storage media, loads it to memory, and transfers control to the kernel. The bootloader itself needs to know the current memory map and allocate some memory for its operation. It tries to obtain the memory map by calling the E820h function and if not supported, it will resort to older PC BIOS interfaces. The kernel boot protocol defines which memory ranges can be used by the bootloader and which memory ranges must be left available for the kernel.

The bootloader itself doesn't modify the memory map or provide the map to the kernel. Instead, when the kernel starts executing, it calls the E820h function and passes to it a 20-bit pointer (in ES:DI ) to a buffer that the kernel knows to be free on x86 platforms according to the boot protocol. Each call returns a memory range descriptor whose size is at least 20 bytes. For more information, refer to the latest version of the ACPI specification. Most BIOS implementations support ACPI.

Assuming a Linux kernel with upstream-default boot parameters, you can use the command dmesg | grep 'BIOS-provided\|e820' dmesg | grep 'BIOS-provided\|e820' to see the memory range descriptor table returned. On my system, it looks like this:

[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x00000000000917ff] usable
[    0.000000] BIOS-e820: [mem 0x0000000000091800-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000d2982fff] usable
[    0.000000] BIOS-e820: [mem 0x00000000d2983000-0x00000000d2989fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x00000000d298a000-0x00000000d2db9fff] usable
[    0.000000] BIOS-e820: [mem 0x00000000d2dba000-0x00000000d323cfff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000d323d000-0x00000000d7eeafff] usable
[    0.000000] BIOS-e820: [mem 0x00000000d7eeb000-0x00000000d7ffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000d8000000-0x00000000d875ffff] usable
[    0.000000] BIOS-e820: [mem 0x00000000d8760000-0x00000000d87fffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000d8800000-0x00000000d8fadfff] usable
[    0.000000] BIOS-e820: [mem 0x00000000d8fae000-0x00000000d8ffffff] ACPI data
[    0.000000] BIOS-e820: [mem 0x00000000d9000000-0x00000000da718fff] usable
[    0.000000] BIOS-e820: [mem 0x00000000da719000-0x00000000da7fffff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x00000000da800000-0x00000000dbe11fff] usable
[    0.000000] BIOS-e820: [mem 0x00000000dbe12000-0x00000000dbffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000dd000000-0x00000000df1fffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000f8000000-0x00000000fbffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fed00000-0x00000000fed03fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000041edfffff] usable
[    0.002320] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.002321] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.002937] e820: update [mem 0xdd000000-0xffffffff] usable ==> reserved
[    0.169287] e820: reserve RAM buffer [mem 0x00091800-0x0009ffff]
[    0.169288] e820: reserve RAM buffer [mem 0xd2983000-0xd3ffffff]
[    0.169289] e820: reserve RAM buffer [mem 0xd2dba000-0xd3ffffff]
[    0.169289] e820: reserve RAM buffer [mem 0xd7eeb000-0xd7ffffff]
[    0.169289] e820: reserve RAM buffer [mem 0xd8760000-0xdbffffff]
[    0.169290] e820: reserve RAM buffer [mem 0xd8fae000-0xdbffffff]
[    0.169291] e820: reserve RAM buffer [mem 0xda719000-0xdbffffff]
[    0.169291] e820: reserve RAM buffer [mem 0xdbe12000-0xdbffffff]
[    0.169292] e820: reserve RAM buffer [mem 0x41ee00000-0x41fffffff]

The memory ranges that start with "BIOS-e820" are described in that table. The first line clearly tells you the source of this information. The exact format of this information depends on the Linux kernel version. In any case, you'll see a range and a type in each entry. The rows that start with "e820" (without the "BIOS-" part) are changes that the kernel itself has made to the table. The implementation of the E820h may be buggy or there could be overlaps between the obtained ranges in different entries. The kernel performs the necessary checks and changes accordingly. The ranges that are marked as "usable" are mostly free for use by the kernel with exceptions discussed in the ACPI specification and of which the kernel is aware. The vast majority of PC BIOS implementations return at most 128 memory ranges descriptors. Older versions of the Linux kernel could only handle up to 128 memory ranges, so any entries returned from E820h beyond the 128th one are ignored. Starting with version?, this limitation was relaxed. For more information, see the series of kernel patches titled "x86 boot: pass E820 memory map entries more than 128 via linked list of setup data."

Ranges of type usable and ACPI data . Ranges of type reserved are backed by DRAM DIMMs or chopped out for MMIO by the CPU or platform firmware. Ranges of type ACPI NVS are backed by firmware memory. All other ranges are not back by actual memory as far as the firmware can tell. Note that the firmware may choose to not map all of the installed DRAM DIMMs or NVDIMMs. This may happen if the physical memory configuration is not supported as is or if the firmware is unable obtain information from an installed DIMM due to an issue in the DIMM.

You can calculate how much memory of the installed DRAM DIMMs and NVDIMMs is made available by the firmware to the kernel. On my system, I've installed 16 GBs of DRAM DIMMs. So unless some of the DIMMs are not installed properly, not functioning correctly, a bug in the firmware, or not supported by the platform or processor, there should be a little bit less than 16 GBs made available to the kernel.

All the usable ranges add up to 0x3FA42B800 bytes. Note that the last address of a range is inclusive, meaning that it points to a byte location that is part of the range. The total amount of physically installed DIMMs is 16 GBs or 0x400000000 bytes. So the total amount of installed memory that was not made available for the kernel is 0x400000000 - 0x3FA42B800 or about 92 MBs of the total 16 GBs. This memory was taken by some of the reserved ranges and all of the ACPI data ranges. If certain locations in a DRAM DIMM or NVDIMM were determined by the platform firmware as unreliable, they will also be chopped out as reserved .

Note that the range 0x000a0000-0x000fffff is not described in the E820 memory map as per the ACPI specification. This is the 640KB-1MB upper memory area. The kernel prints a message that says it has removed this range from the usable memory area to maintain compatiblity with ancient systems.

At this point, memory to be used as MMIO for most of the PCIe devices is not yet allocated. My processor supports a 39-bit physical address space, which means that addresses between 0 to 2^39 are avaiable for mapping. So far only most the bottom 16.5 GBs of this space has been mapped to something. Note that there are still unmapped gaps in this range. The kernel can use these gaps (a few 100s of MBs) and the rest of the physical address space (about 495.5 GBs) to allocate addres ranges for IO devices. The kernel will eventually discover PCIe devices and for each device, it'll try to load a compatible driver if available. The driver then determines how much memory the device needs and any restrictions on the memory addresses imposed by the device and request from the kernel to allocate memory for the device and configure it as an MMIO memory owned by the device. You can see the final memory map using the command sudo cat /proc/iomem .

There are situations where you'd want to manually change the memory type of an existing memory range (eg, for testing), create a new range (eg, for emulating persistent memory on DRAM or if the firmware is unable to discover all of the available memory for whatever reason), reduce the amount of memory usable by the kernel (eg, to prevent a bare-metal hypervisor for using memory beyond a limit and make the rest available for guests), or even completely override the entire table returned from E820h . The mem and memmap kernel paramters can be used for such purposes. When one or more of these parameters are specified with valid values, the kernel will first read the BIOS-provided memory map and make changes accordingly. The kernel will print the final memory map as "user-defined physical RAM map." in the kernel message ring buffer. You can view these messages with dmesg | grep user: dmesg | grep user: (each memory range row starts with "user:"). These messages will be printed after the "BIOS-e820" messages.

On an x86 platform booted with UEFI firmware that supports the Compatibility Support Module (refer to the CSM specification for more information, which is separate from UEFI), the legacy real-mode E820h interface is supported and the Linux kernel by default still uses it. If the kernerl is running on an x86 platform with UEFI that doesn't support CSM, the E820h interface may not provide all or any memory ranges. It may be necessary to use the add_efi_memmap kernel parameter on such platforms. An example can be found at UEFI Memory V E820 Memory . When one or more of the memory ranges are provided from GetMemoryMap() , the kernel merges these ranges with those from the E820h interface. The resulting memory map can be viewed using dmesg | grep 'efi:'dmesg | grep 'efi:' Another UEFI-related kernel parameter that affects the memory map is efi_fake_mem .

The ACPI specification (Section 6.3) provides notfication mechanisms to inform the kernel when an IO or DIMM device has been inserted in or removed from the system in any S-state. (I don't know if there are any motherboads that support removing DIMMs in any S-state, though. This is usually only possible in the G3 state and maybe S4 and/or S5) When such an event occurs, either the kernel or the firmware make changes to the memory map accordingly. These changes are reflected in sudo cat /proc/iomem .

pc-relative addressing refers to a programming technique where your program can operate at any address. Since relocation registers (eg. segments) have become passe, most pc-relative programming is performed explicitly. Here is an example in a generic sort of machine code:

.text
entry:
    call reloc  /* call is pc relative */
reloc:
    pop %r0     /* r0 now contains physical address of reloc */
    sub $reloc, %r0, %r14  /* r14 contains difference between link address of reloc */
/* At this point, r14 is a relocation register.  A virtual address + r14 == the corresponding physical address. */
    add $proot, %r14, %r0  /* physical address of page table root */
    add $entry, %r14, %r1  /* entry is where we were loaded into ram */
    test $0xfff, %r1   /* someone is being funny and not page aligning us */
    jnz bad_alignment
    or   $0x7, %r1     /* put mythical page protection bits in r1 */
    mov $1024, %r2     /* number of pages in r2 */
loop:
    store %r1, (%r0)   /* store a page table entry */
    add $0x1000, %r1   /* setup next one 4096 bytes farther */
    add $4, %r0        /* point to next page table entry */
    sub $1, r2         /* are we done? */
    cmp %0, r2
    jne loop           /* nope, setup next entry */
    add $proot, %r14, %r0
    loadsysreg %r0, page_table_base_register
    mov $1, %r0
    mov $v_entry, %r1
    loadsysreg %r0, page_table_enabled
    jmp %r1
v_entry:
        /* now we are virtually addressed */
    call main
1:  jmp 1b   /* main shouldn't return. */


.data
.align 12   /* 4096 byte pages */
proot:
.zero 4096
.text

This mythical machine is very simple, with a single flat page table, and the kernel is linked at address 0, but could be run from anywhere in the first 4M (1024 * 4096). Real machines are just more detailed versions of this. In general, you cannot trust even system languages like C until you have the initial address space setup. Once it is, code in it can construct much more intricate page tables, and query databases like device tree, or even monstrosities like apic/uefi for more information about ram layout, etc.

In forward mapped page table architectures where the interior nodes are in a compatible format as the leaf nodes (x86-classic, for example) you can use a singe page table recursively to allow a more flexible link address. For example, if you pointed the last entry in proot (that is proot[1023]) back to proot, then you could link your OS at 0xffffc000, and this code would just work (once translated to x86).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM