简体   繁体   中英

Can't trace a subprocess's syscalls which calls execve using ptrace and seccomp

I am creating a syscall tracer using seccomp . I don't change anything in the system call, I just log it in my structure and when the process finishes - I dump this structure on a disk.

When I run my program like this (it's called tracer ):

tracer env

Everything works well, and I see the logs in the file after. However, if I try to trace a program which calls execve inside, it fails:

tracer watch -n1 env

or

tracer strace -o /tmp/log env

fails with the stdout

env: error while loading shared libraries: cannot create cache for search path: Cannot allocate memory

and the log:

$ cat /tmp/log
execve("/usr/bin/env", ["env"], [/* 19 vars */]) = 0
brk(NULL)                               = 0x415000
mmap(0xffffffffffffffda, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2
writev(103, [{iov_base="env", iov_len=3}, {iov_base=": ", iov_len=2}, {iov_base="error while loading shared libraries", iov_len=36}, {iov_base=": ", iov_len=2}, {iov_base="", iov_len=0}, {iov_base="", iov_len=0}, {iov_base="cannot create cache for search path", iov_len=35}, {iov_base=": ", iov_len=2}, {iov_base="Cannot allocate memory", iov_len=22}, {iov_base="\n", iov_len=1}], 10) = 127
+++ exited with 127 +++

Notice the weird mmap address and its return value. I don't understand what is wrong and why does this happen. Any other program works fine, so I guess the problem is with copying seccomp filters to the forked process which calls execve .

Here are my seccomp rules:

struct sock_filter filter[] = {
    BPF_STMT(BPF_LD + BPF_W + BPF_ABS, offsetof(struct seccomp_data, nr)),
    BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __NR_openat, 0, 1),
    BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRACE),
    BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __NR_write, 0, 1),
    BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRACE),
    BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __NR_mmap, 0, 1),
    BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRACE),
    BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __NR_mprotect, 0, 1),
    BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRACE),
    BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __NR_close, 0, 1),
    BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRACE),
    BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
};

I don't list the whole code as it is obvious and can be only written in a single way, also, it is written in the article I referred to above. The problem is also known in the Internet but I was not able to find any solution. If you still insist on the whole code (I doubt that) or MCVE, I can provide it.

Also, when I add the execve trace I have different behavior:

struct sock_filter filter[] = {
    BPF_STMT(BPF_LD + BPF_W + BPF_ABS, offsetof(struct seccomp_data, nr)),
    BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __NR_openat, 0, 1),
    BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRACE),
    BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __NR_write, 0, 1),
    BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRACE),
    BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __NR_mmap, 0, 1),
    BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRACE),
    BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __NR_mprotect, 0, 1),
    BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRACE),
    BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __NR_close, 0, 1),
    BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRACE),
    BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __NR_execve, 0, 1),
    BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRACE),
    BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
};

The log becomes:

$ cat /tmp/log
execve(0xffffffffffffffda, ["env"], [/* 19 vars */]) = -1 ENOSYS (Function not implemented)
getpid()                                = 15535
exit_group(1)                           = ?
+++ exited with 1 +++

Linux 4.4 aarch64, Linux 4.15 x86-64

The more time I spend on this problem, the more I realize that the problem is actually in the kernel's source code. It copies the filters from one process to another , child one but they don't copy the implementation, and so all of the SECCOMP_RET_TRACE rules are copied and there is no tracer in the child, so every system call in the subchild returns -ENOSYS as there is no tracer there, however, the rules are copied.

I have found a way to solve this problem. To set up the tracer for children processes as well or at least to avoid the ENOSYS problem for sub-children, we can specify the PTRACE_O_TRACEFORK and PTRACE_O_TRACECLONE flag while setting ptrace options like that:

ptrace(PTRACE_SETOPTIONS, child, 0, PTRACE_O_TRACESECCOMP | PTRACE_O_TRACEFORK | PTRACE_O_TRACECLONE);

The reason why we need to add both is not easy to explain briefly. At first, it is architecture and libc -dependent which syscalls are present in the system and which are used by the programs (usually, through the libc implementation). Perhaps, even this list is not full: we may also have to track VFORK and other ways related to cloning (or spawning) a thread or a process (remember, thread are light-weight processes in Linux). So, what these options do is specified in the man :

PTRACE_O_TRACECLONE (since Linux 2.5.46) Stop the tracee at the next clone(2) and automatically start tracing the newly cloned process, which will start with a SIGSTOP , or PTRACE_EVENT_STOP if PTRACE_SEIZE was used. A waitpid(2) by the tracer will return a status value such that

 status>>8 == (SIGTRAP | (PTRACE_EVENT_CLONE<<8)) 

The PID of the new process can be retrieved with PTRACE_GETEVENTMSG . This option may not catch clone(2) calls in all cases. If the tracee calls clone(2) with the CLONE_VFORK flag, PTRACE_EVENT_VFORK will be delivered instead if PTRACE_O_TRACEVFORK is set; otherwise if the tracee calls clone(2) with the exit signal set to SIGCHLD , PTRACE_EVENT_FORK will be delivered if PTRACE_O_TRACE‐FORK is set.

The reason why it works in my case is that after simple cloning, seccomp rules were copied to the cloned process, but the tracer wasn't. By specifying these flags, the parent process becomes the tracer automatically for every child process, and so, as rules are copied, and tracer is specified, everything works like a charm.

NOTE As using this way the parent process becomes the tracer, you will also need to wait for all children and sub-children, not only the process you actually spawned. To do this, use -1 as a pid argument in waitpid or similar syscalls:

const pid_t childWaited = waitpid(-1, &status, 0);
// but not const pid_t result = waitpid(myChildPid, &status, 0);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM