QEMU for AArch64: why execution stucks at "ldr q1, [x0]"?

Question

I have this simple C code:

#include "uart.h"
#include <string.h>
char x[32];
__attribute__((noinline))
void foo(void)
{
    strcpy(x, "xxxxxxxxxxxxxxxxxxxxxxxx");
}
int main(void)
{
  uart_puts("xxx\n");
  foo();
  uart_puts("yyy\n");
}

compiled as:

$ aarch64-none-elf-gcc t78.c -mcpu=cortex-a57 -Wall -Wextra -g -O2 -c -std=c11 \
&& aarch64-none-elf-ld -T linker.ld t78.o boot.o uart.o -o kernel.elf

and executed as:

$ qemu-system-aarch64.exe -machine virt -cpu cortex-a57 -nographic -kernel kernel.elf

prints:

xxx

Why yyy is not printed?

By reducing the issue I've found that:

for strcpy GCC generated a code other than "call strcpy" (see below)
ldr q1, [x0] causes yyy to not be printed.

Here is the generated code of foo :

foo:
.LFB0:
        .file 1 "t78.c"
        .loc 1 6 1 view -0
        .cfi_startproc
        .loc 1 7 5 view .LVU1
        adrp    x0, .LC0
        add     x0, x0, :lo12:.LC0
        adrp    x1, .LANCHOR0
        add     x2, x1, :lo12:.LANCHOR0
        ldr     q1, [x0]                     <<== root cause
        ldr     q0, [x0, 9]
        str     q1, [x1, #:lo12:.LANCHOR0]
        str     q0, [x2, 9]
        .loc 1 8 1 is_stmt 0 view .LVU2
        ret

If I put ret before ldr q1, [x0] the yyy is printed (ax expected).

The question: why ldr q1, [x0] causes yyy to not be printed?

Tool versions:

$ aarch64-none-elf-gcc --version
aarch64-none-elf-gcc.exe (Arm GNU Toolchain 12.2.Rel1 (Build arm-12.24)) 12.2.1 20221205

$ qemu-system-aarch64 --version
QEMU emulator version 7.2.0 (v7.2.0-11948-ge6523b71fc-dirty)

Answer 1

The ldr q1, [x0] instruction is taking an exception because it accesses a floating-point/SIMD register but your startup code does not enable the FPU. The compiler is assuming that it can generate code that uses the FPU, so to meet that assumption one of the things your startup code must do is enable the FPU, via at least CPACR_EL1, and possibly other registers if EL2 or EL3 are enabled.

Alternatively, you could tell the compiler not to emit code that uses the FPU. The Linux kernel takes this approach, using the -mgeneral-regs-only option.

Real hardware probably has more strict requirements for what you need to do to configure the CPU to be able to run C code; QEMU is quite lenient. For instance the architecture defines that the reset value of many system registers is UNKNOWN, though QEMU usually resets them to zero. A robust startup sequence will explicitly set bits in registers like SCTLR_EL1.

You may also need to watch out for whether your compiler and your startup code agree about whether the compiler generated code is allowed to emit unaligned accesses -- if the MMU is not enabled then all memory accesses are treated as of type Device, which means they must be aligned (regardless of SCTLR_EL1.A). So you either need to make sure your compiler doesn't try to emit unaligned loads and stores, or else turn on the MMU and set SCTLR_EL1.A to 0.

You could improve your ability to debug this sort of "exception in early bootup" by installing some exception vectors which do something helpful when an unexpected exception occurs. The ideal is to be able to print registers, especially ELR_EL1 and ESR_EL1, which tell you where and why the exception occurred; printing in early bootup can be tricky, though. An easy compromise is to at least catch the exception and loop; you can then use gdb to see what the CPU state is.

Answer 2

An addition to answer by Peter Maydell.

Here is the code that enables FPU (found here ):

mrs    x1, cpacr_el1
mov    x0, #(3 << 20)
orr    x0, x1, x0
msr    cpacr_el1, x0

QEMU for AArch64: why execution stucks at "ldr q1, [x0]"?

Question

2 answers

solution1
1 ACCPTED 2023-02-01 15:43:58

solution2
0 2023-02-02 11:53:26

QEMU for AArch64: why execution stucks at "ldr q1, [x0]"?

Question

2 answers

solution1 1 ACCPTED 2023-02-01 15:43:58

solution2 0 2023-02-02 11:53:26

solution1
1 ACCPTED 2023-02-01 15:43:58

solution2
0 2023-02-02 11:53:26