vfork+execve strange when using syscall

Question

If you execute the code below you'll see execve returns a process id and parent never executes. I tried looking for documentation but I either can't find it or can't understand it. clone talks about vfork (CLONE_VFORK) and says the below but the parent never seems to execute. If you uncomment the non sys call vfork or use the syscall fork it'll work as expected

the execution of the calling process is suspended until the child releases its virtual memory resources via a call to execve(2) or _exit(2) (as with vfork(2)).

#include <unistd.h>
#include <syscall.h>
#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[])
{
    //int a = vfork();
    //int a = syscall(__NR_fork);
    int a = syscall(__NR_vfork);
    if (a) {
        write(2, "parent\n", 7);
    } else {
        char*args[] = {"/usr/bin/true", (char*)0};
        int res = execve(args[0], args, &argv[2]);
        char buf[256];
        sprintf(buf, "child got %d\n", res);
        write(2, buf, strlen(buf));
    }
    write(2, "Done\nChild\n", a?5:11);
}

Answer 1

I was curious what exactly did happen. I used strace -f./a.out to see output like this, showing that it's the parent making a write(2, "Done\nChild\n", 11) system call. (lower-numbered PID, and not the new PID strace reports attaching to after vfork)

...
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
munmap(0x7f7e48c59000, 193483)          = 0
vfork(strace: Process 515667 attached
 <unfinished ...>
[pid 515667] execve("/usr/bin/true", ["/usr/bin/true"], 0x7ffc4447ce18 /* 60 vars */ <unfinished ...>
[pid 515666] <... vfork resumed>)       = 515667
[pid 515666] write(2, "child got 515667\n", 17child got 515667
) = 17
[pid 515667] <... execve resumed>)      = 0
[pid 515666] write(2, "Done\nChild\n", 11Done
Child
) = 11
[pid 515667] brk(NULL <unfinished ...>
[pid 515666] exit_group(0 <unfinished ...>
[pid 515667] <... brk resumed>)         = 0x5603b644c000
[pid 515666] <... exit_group resumed>)  = ?
[pid 515667] arch_prctl(0x3001 /* ARCH_??? */, 0x7ffc878f2720) = -1 EINVAL (Invalid argument)
[pid 515666] +++ exited with 0 +++
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
... the parent has exited by now, leaving just the child running the dynamic linker for /usr/bin/true

This is terminal output mixed with strace output; I could have used strace -f -o vfork.trace./a.out to capture the log separately, or ./a.out &>/dev/null .

The child overwrites the parent's return address, to the `execve` call site

The actual behaviour of this C code with undefined behaviour happened to be the same with gcc (-O0 by default), gcc -O3 , and clang -O3 . So for asm that was easier to single-step with GDB, I built it with gcc -O3 -fno-plt on my Arch GNU/Linux system (GCC12.2 in case it matters). -fno-plt means that dynamic linking isn't "lazy", so we can step into library functions.

It was also handy to look at the compiler's asm source with symbolic names ( https://godbolt.org/z/j6ME6rWaa ).

After vfork , GDB detaches the child and lets it run, so you're still single-stepping the parent.

The parent's return from the glibc syscall() wrapper function is not to the test eax,eax instruction after call syscall , it's to the instruction after a different call It seems that after the child returns from vfork , it ends up overwriting the return address on the stack before the parent has a chance to run. That makes sense; the compiler-generated asm for main doesn't adjust RSP after function entry, so any other call would push a return address to the same place, overwriting the return address in the other process.

The glibc wrapper for vfork avoids this by popping the return address around the syscall and pushing it right after, to make it work under the conditions where POSIX and the Linux man page says it should. (Which don't include the way you're using it, but even in a safe usage, call execve before the parent can ret from a wrapper function would be a problem.) The glibc wrapper's correctness also relies on the kernel semantics of not running the parent until after the child has exited or execve'd, see a later section below; if looking at just the user-space asm, you'd think there'd be a possible race condition and that it might only usually work.

The actual place it returned to was a RIP-relative LEA following a call , not a test eax,eax . That was the lightbulb moment, the clue that a return address would have been overwritten. That LEA is setting up args for sprintf ; the preceding call was call execve .

That makes sense; execve is the last thing the child did since it only returns on error; on success it replaces the process with a fresh address space that's no longer shared with the parent.

After the child returned from syscall(__NR_vfork) ,it branched and called execve , pushing that return address, overwriting the parent's return address from call syscall because they share an address-space including the stack.

That leaves just the parent, executing from the return path of execve() , which in a non-buggy (or non-hacky) program would only be reachable on error.

So it does the sprintf. It prints child got 515667 because that PID was the value in EAX as the parent was returning from vfork (to this block of code which takes res from the EAX return value of this other call site.)

As for how it manages to pick 11 instead of 5 as the length for the write system call, the details probably differ in debug vs. optimized builds. In an optimized build, different branches of the if(a) leave a different number in a register which the call to write() uses.

In a debug build, only the child returned to the vfork call site and stored an a value to the stack.

Shenanigans like this are why nobody uses vfork anymore; a couple copy-on-write page-faults are cheap enough that it's not worth playing with fire.

It's also why the rules on how you're allowed to use vfork are very restrictive; you'd better have your args for execve already constructed before you call vfork , so the very next thing can be a call execve .

`syscall(__NR_vfork)` isn't safe; it needs special handling

Single-stepping into the glibc wrapper ( stepi aka si in GDB, in layout asm TUI mode), we can see its asm.

│    0x7ffff7e7d830 <vfork>          endbr64
│    0x7ffff7e7d834 <vfork+4>        pop    rdi
│    0x7ffff7e7d835 <vfork+5>        mov    eax,0x3a
│    0x7ffff7e7d83a <vfork+10>       syscall
│    0x7ffff7e7d83c <vfork+12>       push   rdi
│  > 0x7ffff7e7d83d <vfork+13>       cmp    eax,0xfffff001     # EAX >= -ERRNO_MAX
│    0x7ffff7e7d842 <vfork+18>       jae    0x7ffff7e7d858 <vfork+40>                                                                                                                                                                
               # else no-error return path.
│    0x7ffff7e7d844 <vfork+20>       xor    esi,esi
│    0x7ffff7e7d846 <vfork+22>       rdsspq rsi
│    0x7ffff7e7d84b <vfork+27>       test   rsi,rsi   # if shadow stack not in use
│    0x7ffff7e7d84e <vfork+30>       je     0x7ffff7e7d857 <vfork+39>
│    0x7ffff7e7d850 <vfork+32>       test   eax,eax   # in parent, normal return
│    0x7ffff7e7d852 <vfork+34>       jne    0x7ffff7e7d857 <vfork+39>
│    0x7ffff7e7d854 <vfork+36>       pop    rdi         # pop real return address
│    0x7ffff7e7d855 <vfork+37>       jmp    rdi         # and manually return to the correct address from the shadow stack?

     # no shadow-stack path of execution, return normally.
│    0x7ffff7e7d857 <vfork+39>       ret

  # error handling, set errno and return -1
│    0x7ffff7e7d858 <vfork+40>       mov    rcx,QWORD PTR [rip+0x105509]        # 0x7ffff7f82d68
│    0x7ffff7e7d85f <vfork+47>       neg    eax
│    0x7ffff7e7d861 <vfork+49>       mov    DWORD PTR fs:[rcx],eax
│    0x7ffff7e7d864 <vfork+52>       or     rax,0xffffffffffffffff   # code-size optimization for mov rax,-1   (really rarely executed for most system calls)
│    0x7ffff7e7d868 <vfork+56>       ret

rdsspq reads the "shadow stack" pointer, in case the caller was using CET, Control-flow Enforcement Technology. I'm not familiar with CET, so my comments on that part are guesswork based on what this function probably needs to do, and how it's using these instructions.

I should have just looked at the hand-written glibc source which has comments, glibc/sysdeps/unix/sysv/linux/x86_64/vfork.S ; updated with some from there.

It seems like there could still be a race with the child, like if our push rdi runs before the child returns and calls execve . Under normal scheduling conditions, though, the child does run first.

But no, there's special logic to handle that:

https://man7.org/linux/man-pages/man2/vfork.2.html

vfork() differs from fork(2) in that the calling thread is suspended until the child terminates (either normally, by calling _exit(2), or abnormally, after delivery of a fatal signal), or it makes a call to execve(2) . Until that point, the child shares all memory with its parent, including the stack. The child must not return from the current function or call exit(3) (which would have the effect of calling exit handlers established by the parent process and flushing the parent's stdio(3) buffers), but may call _exit(2).

As you mentioned in comments, if you wanted to use this for concurrency / threading, use pthread_create(3) to start threads, not vfork() , Or the same raw system call it uses, clone(CLONE_THREAD) . (Note that the glibc wrapper for clone uses the new thread's stack memory to store a code pointer to be called; the kernel API/ABI doesn't have a code-pointer arg; see the C library / kernel differences part of the man page, and maybe the glibc ource code for clone() .)

These days, vfork is implemented inside the kernel as clone( flags=CLONE_VM | CLONE_VFORK | SIGCHLD ) .

Answer 2

There are multiple instances of undefined behavior in the code.

You are invoking undefined behavior by making calls such as sprintf() and write() after execve() fails. Per POSIX :

... the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork() , or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit() or one of the exec family of functions.

Even simply returning from main() after vfork() invokes undefined behavior.

@Barmar summed it up best: "you should just not use vfork() at all"

This code also invokes undefined behavior:

    char*args[] = {"/usr/bin/true", (char*)0};
    int res = execve(args[0], args, &argv[2]);

argv[2] doesn't exist, so passing its address to execve() invokes undefined behavior. Note that taking the address of argv[2] does not in itself invoke undefined behavior - an address one past the actual end of an array does exist. But it can't be safely derferenced, which execve() will do.

execve() expects a pointer to an array of environment pointers as its third argument :

Using execve()

The following example passes arguments to the ls command in the cmd array, and specifies the environment for the new process image using the env argument.
 #include <unistd.h> int ret; char *cmd[] = { "ls", "-l", (char *)0 }; char *env[] = { "HOME=/usr/home", "LOGNAME=home", (char *)0 }; ... ret = execve ("/bin/ls", cmd, env);

vfork+execve strange when using syscall

Question

2 answers

solution1
2 ACCPTED 2022-12-31 05:33:15

The child overwrites the parent's return address, to the `execve` call site

`syscall(__NR_vfork)` isn't safe; it needs special handling

solution2
1 2022-12-30 01:07:30

vfork+execve strange when using syscall

Question

2 answers

solution1 2 ACCPTED 2022-12-31 05:33:15

The child overwrites the parent's return address, to the execve call site

syscall(__NR_vfork) isn't safe; it needs special handling

solution2 1 2022-12-30 01:07:30

solution1
2 ACCPTED 2022-12-31 05:33:15

The child overwrites the parent's return address, to the `execve` call site

`syscall(__NR_vfork)` isn't safe; it needs special handling

solution2
1 2022-12-30 01:07:30