seccomp - 如何EXIT_SUCCESS？

Question

Ηow to EXIT_SUCCESS after strict mode seccomp is set. 设置严格模式seccomp后如何EXIT_SUCCESS。 Is it the correct practice, to call syscall(SYS_exit, EXIT_SUCCESS); 这是正确的做法，调用syscall(SYS_exit, EXIT_SUCCESS); at the end of main? 在主要结束？

#include <stdlib.h>
#include <unistd.h> 
#include <sys/prctl.h>     
#include <linux/seccomp.h> 
#include <sys/syscall.h>

int main(int argc, char **argv) {
  prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);

  //return EXIT_SUCCESS; // does not work
  //_exit(EXIT_SUCCESS); // does not work
  // syscall(__NR_exit, EXIT_SUCCESS); // (EDIT) This works! Is this the ultimate answer and the right way to exit success from seccomp-ed programs?
  syscall(SYS_exit, EXIT_SUCCESS); // (EDIT) works; SYS_exit equals __NR_exit
}

// gcc seccomp.c -o seccomp && ./seccomp; echo "${?}" # I want 0

Answer 1

As explained in eigenstate.org and in SECCOMP (2) : 如在解释eigenstate.org和的Seccomp（2）：

The only system calls that the calling thread is permitted to make are read(2), write(2), _exit(2) ( but not exit_group(2)), and sigreturn(2). 调用线程被允许进行的唯一系统调用是read（2），write（2），_ exit（2）（ 但不是 exit_group（2））和sigreturn（2）。 Other system calls result in the delivery of a SIGKILL signal. 其他系统调用导致SIGKILL信号的传递。

As a result, one would expect _exit() to work, but it's a wrapper function that invokes exit_group(2) which is not allowed in strict mode ( [1] , [2] ), thus the process gets killed. 因此，人们会期望_exit()能够工作，但它是一个调用exit_group(2)的包装函数，在严格模式下不允许这样做（ [1] ， [2] ），因此进程被终止。

It's even reported in exit(2) - Linux man page : 它甚至在exit（2）中报告- Linux手册页：

In glibc up to version 2.3, the _exit() wrapper function invoked the kernel system call of the same name. 在glibc到2.3版本中， _exit（）包装函数调用了同名的内核系统调用。 Since glibc 2.3, the wrapper function invokes exit_group(2) , in order to terminate all of the threads in a process. 从glibc 2.3开始， 包装器函数调用exit_group（2） ，以终止进程中的所有线程。

Same happens with the return statement, which should end up in killing your process, in the very similar manner with _exit() . return语句也是如此，它最终会以与_exit()非常相似的方式杀死你的进程。

Stracing the process will provide further confirmation (to allow this to show up, you have to not set PR_SET_SECCOMP; just comment prctl() ) and I got similar output for both non-working cases: 对该过程进行测试将提供进一步的确认（为了使其显示，您必须不设置PR_SET_SECCOMP;只需注释prctl() ）并且我得到了两个非工作情况的类似输出：

linux12:/home/users/grad1459>gcc seccomp.c -o seccomp
linux12:/home/users/grad1459>strace ./seccomp
execve("./seccomp", ["./seccomp"], [/* 24 vars */]) = 0
brk(0)                                  = 0x8784000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb775f000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=97472, ...}) = 0
mmap2(NULL, 97472, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7747000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
open("/lib/i386-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\220\226\1\0004\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1730024, ...}) = 0
mmap2(NULL, 1739484, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xdd0000
mmap2(0xf73000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a3) = 0xf73000
mmap2(0xf76000, 10972, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xf76000
close(3)                                = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7746000
set_thread_area({entry_number:-1 -> 6, base_addr:0xb7746900, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0
mprotect(0xf73000, 8192, PROT_READ)     = 0
mprotect(0x8049000, 4096, PROT_READ)    = 0
mprotect(0x16e000, 4096, PROT_READ)     = 0
munmap(0xb7747000, 97472)               = 0
exit_group(0)                           = ?
linux12:/home/users/grad1459>

As you can see, exit_group() is called, explaining everything! 如您所见， exit_group()被调用，解释一切！

Now as you correctly stated, " SYS_exit equals __NR_exit "; 现在你正确地说，“ SYS_exit equals __NR_exit ”; for example it's defined in mit.syscall.h : 例如，它在mit.syscall.h中定义：

#define SYS_exit __NR_exit

so the last two calls are equivalent, ie you can use the one you like, and the output should be this: 所以最后两个调用是等价的，即你可以使用你喜欢的那个，输出应该是这样的：

linux12:/home/users/grad1459>gcc seccomp.c -o seccomp && ./seccomp ; echo "${?}" 
0

PS PS

You could of course define a filter yourself and use: 您当然可以自己定义filter并使用：

prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, filter);

as explained in the eigenstate link, to allow _exit() (or, strictly speaking, exit_group(2) ), but do that only if you really need to and know what you are doing. 正如在本征状态链接中所解释的那样，允许_exit() （或严格来说， exit_group(2) ），但只有在你确实需要知道自己在做什么时才这样做。

Answer 2

The problem occurs, because the GNU C library uses the exit_group syscall, if it is available, in Linux instead of exit , for the _exit() function (see sysdeps/unix/sysv/linux/_exit.c for verification), and as documented in the man 2 prctl , the exit_group syscall is not allowed by the strict seccomp filter. 出现问题，因为GNU C库使用exit_group系统调用（如果可用），在Linux而不是exit ，用于_exit()函数（请参阅sysdeps/unix/sysv/linux/_exit.c进行验证），以及在man 2 prctl ，严格的seccomp过滤器不允许exit_group系统调用。

Because the _exit() function call occurs inside the C library, we cannot interpose it with our own version (that would just do the exit syscall). 因为_exit()函数调用发生在C库中，所以我们不能将它与我们自己的版本一起插入（它只会执行exit syscall）。 (The normal process cleanup is done elsewhere; in Linux, the _exit() function only does the final syscall that terminates the process.) （正常的进程清理在其他地方完成;在Linux中， _exit()函数只执行终止进程的最终系统调用。）

We could ask the GNU C library developers to use the exit_group syscall in Linux only when there are more than one thread in the current process, but unfortunately, it would not be easy, and even if added right now, would take quite some time for the feature to be available on most Linux distributions. 我们可以要求GNU C库开发人员只在当前进程中有多个线程时才在Linux中使用exit_group系统调用，但不幸的是，这并不容易，即使现在添加，也需要相当长的时间大多数Linux发行版上都提供的功能。

Fortunately, we can ditch the default strict filter, and instead define our own. 幸运的是，我们可以抛弃默认的严格过滤器，而是定义我们自己的。 There is a small difference in behaviour: the apparent signal that kills the process will change from SIGKILL to SIGSYS . 行为存在细微差别：杀死进程的明显信号将从SIGKILL变为SIGSYS 。 (The signal is not actually delivered, as the kernel does kill the process; only the apparent signal number that caused the process to die changes.) （信号实际上并没有传递，因为内核会杀死进程;只有导致进程死亡的明显信号数会发生变化。）

Furthermore, this is not even that difficult. 此外，这甚至不是那么困难。 I did waste a bit of time looking into some GCC macro trickery that would make it trivial to manage the allowed syscalls' list, but I decided it would not be a good approach: the list of allowed syscalls should be carefully considered -- we only add exit_group() compared to the strict filter, here! 我确实浪费了一些时间来研究一些GCC宏诡计，这会使管理允许的系统调用列表变得微不足道，但我认为这不是一个好方法：应该仔细考虑允许的系统调用列表 - 我们只添加exit_group()与严格过滤器相比，这里！ -- so making it a bit difficult is okay. - 所以让它有点困难是可以的。

The following code, say example.c , has been verified to work on a 4.4 kernel (should work on kernels 3.5 or later) on x86-64 (for both x86 and x86-64, ie 32-bit and 64-bit binaries). 下面的代码，例如example.c ，已经过验证，可以在x86-64上运行4.4内核（应该在内核3.5或更高版本上运行）（对于x86和x86-64，即32位和 64位二进制文件）。 It should work on all Linux architectures, however, and it does not require or use the libseccomp library. 它应该在所有的Linux架构的工作，但是，它并不需要或使用libseccomp库。

#define  _GNU_SOURCE
#include <stdlib.h>
#include <stddef.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <stdio.h>

static const struct sock_filter  strict_filter[] = {
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof (struct seccomp_data, nr))),

    BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_rt_sigreturn, 5, 0),
    BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_read,         4, 0),
    BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_write,        3, 0),
    BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_exit,         2, 0),
    BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_exit_group,   1, 0),

    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW)
};

static const struct sock_fprog  strict = {
    .len = (unsigned short)( sizeof strict_filter / sizeof strict_filter[0] ),
    .filter = (struct sock_filter *)strict_filter
};

int main(void)
{
    /* To be able to set a custom filter, we need to set the "no new privs" flag.
       The Documentation/prctl/no_new_privs.txt file in the Linux kernel
       recommends this exact form: */
    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
        fprintf(stderr, "Cannot set no_new_privs: %m.\n");
        return EXIT_FAILURE;
    }
    if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &strict)) {
        fprintf(stderr, "Cannot install seccomp filter: %m.\n");
        return EXIT_FAILURE;
    }

    /* The seccomp filter is now active.
       It differs from SECCOMP_SET_MODE_STRICT in two ways:
         1. exit_group syscall is allowed; it just terminates the
            process
         2. Parent/reaper sees SIGSYS as the killing signal instead of
            SIGKILL, if the process tries to do a syscall not in the
            explicitly allowed list
    */

    return EXIT_SUCCESS;
}

Compile using eg 使用例如编译

gcc -Wall -O2 example.c -o example

and run using 并运行

./example

or under strace to see the syscalls and library calls done; 或者在strace下看到完成的系统调用和库调用;

strace ./example

The strict_filter BPF program is really trivial. strict_filter BPF程序非常简单。 The first opcode loads the syscall number into the accumulator. 第一个操作码将系统调用号加载到累加器中。 The next five opcodes compare it to an acceptable syscall number, and if found, jump to the final opcode that allows the syscall. 接下来的五个操作码将它与可接受的系统调用号进行比较，如果找到，则跳转到允许系统调用的最终操作码。 Otherwise the second-to-last opcode kills the process. 否则，倒数第二个操作码会终止该过程。

Note that although the documentation refers to sigreturn being the allowed syscall, the actual name of the syscall in Linux is rt_sigreturn . 请注意，虽然文档中指的是sigreturn是允许的系统调用，但Linux中系统调用的实际名称是rt_sigreturn 。 ( sigreturn was deprecated in favour of rt_sigreturn ages ago.) （ sigreturn在很久以前就被弃用了rt_sigreturn 。）

Furthermore, when the filter is installed, the opcodes are copied to kernel memory (see kernel/seccomp.c in the Linux kernel sources), so it does not affect the filter in any way if the data is modified later. 此外，安装过滤器时，操作码将被复制到内核内存中（请参阅Linux内核源kernel/seccomp.c中的kernel/seccomp.c ），因此如果以后修改数据，它不会以任何方式影响过滤器。 Having the structures static const has zero security impact, in other words. 换句话说，使结构static const对安全性没有影响。

I used static since there is no need for the symbols to be visible outside this compilation unit (or in a stripped binary), and const to put the data into the read-only data section of the ELF binary. 我使用static因为在编译单元之外（或在剥离的二进制文件中）不需要符号，而const则将数据放入ELF二进制文件的只读数据部分。

The form of a BPF_JUMP(BPF_JMP | BPF_JEQ, nr, equals, differs) is simple: the accumulator (the syscall number) is compared to nr . BPF_JUMP(BPF_JMP | BPF_JEQ, nr, equals, differs)很简单：将累加器（系统调用号）与nr进行比较。 If they are equal, then the next equals opcodes are skipped. 如果它们相等，则跳过下一个equals操作码。 Otherwise, the next differs opcodes are skipped. 否则，跳过下一个differs操作码。

Since the equals cases jump to the very final opcode, you can add new opcodes at the top (that is, just after the initial opcode), incrementing the equals skip count for each one. 由于equals情况跳转到最终的操作码，您可以在顶部添加新的操作码（即，在初始操作码之后），为每个操作码增加等于跳过计数。

Note that printf() will not work after the seccomp filter is installed, because internally, the C library wants to do a fstat syscall (on standard output), and a brk syscall to allocate some memory for a buffer. 请注意，安装seccomp过滤器后printf()将无法工作，因为在内部，C库需要执行fstat系统调用（在标准输出上），并且brk系统调用为缓冲区分配一些内存。

seccomp - 如何EXIT_SUCCESS？

问题描述

2 个解决方案

解决方案1
12 2016-11-06 23:52:23

解决方案2
7 2016-11-08 22:38:48

seccomp - 如何EXIT_SUCCESS？

问题描述

2 个解决方案

解决方案1 12 2016-11-06 23:52:23

解决方案2 7 2016-11-08 22:38:48

解决方案1
12 2016-11-06 23:52:23

解决方案2
7 2016-11-08 22:38:48