这个没有 libc 的 C 程序如何工作？

Question

I came across a minimal HTTP server that is written without libc: https://github.com/Francesco149/nolibc-httpd我遇到了一个没有 libc 的最小 HTTP 服务器： https://github.com/Francesco149/nolibc-httpd

I can see that basic string handling functions are defined, leading to the write syscall:我可以看到定义了基本的字符串处理函数，导致了write系统调用：

#define fprint(fd, s) write(fd, s, strlen(s))
#define fprintn(fd, s, n) write(fd, s, n)
#define fprintl(fd, s) fprintn(fd, s, sizeof(s) - 1)
#define fprintln(fd, s) fprintl(fd, s "\n")
#define print(s) fprint(1, s)
#define printn(s, n) fprintn(1, s, n)
#define printl(s) fprintl(1, s)
#define println(s) fprintln(1, s)

And the basic syscalls are declared in the C file:基本系统调用在 C 文件中声明：

size_t read(int fd, void *buf, size_t nbyte);
ssize_t write(int fd, const void *buf, size_t nbyte);
int open(const char *path, int flags);
int close(int fd);
int socket(int domain, int type, int protocol);
int accept(int socket, sockaddr_in_t *restrict address,
           socklen_t *restrict address_len);
int shutdown(int socket, int how);
int bind(int socket, const sockaddr_in_t *address, socklen_t address_len);
int listen(int socket, int backlog);
int setsockopt(int socket, int level, int option_name, const void *option_value,
               socklen_t option_len);
int fork();
void exit(int status);

So I guess the magic happens in start.S , which contains _start and a special way of encoding syscalls by creating global labels which fall through and accumulating values in r9 to save bytes:所以我猜魔法发生在start.S中，它包含_start和一种通过创建全局标签来编码系统调用的特殊方式，这些标签通过并在 r9 中累积值以节省字节：

.intel_syntax noprefix

/* functions: rdi, rsi, rdx, rcx, r8, r9 */
/*  syscalls: rdi, rsi, rdx, r10, r8, r9 */
/*                           ^^^         */
/* stack grows from a high address to a low address */

#define c(x, n) \
.global x; \
x:; \
  add r9,n

c(exit, 3)       /* 60 */
c(fork, 3)       /* 57 */
c(setsockopt, 4) /* 54 */
c(listen, 1)     /* 50 */
c(bind, 1)       /* 49 */
c(shutdown, 5)   /* 48 */
c(accept, 2)     /* 43 */
c(socket, 38)    /* 41 */
c(close, 1)      /* 03 */
c(open, 1)       /* 02 */
c(write, 1)      /* 01 */
.global read     /* 00 */
read:
  mov r10,rcx
  mov rax,r9
  xor r9,r9
  syscall
  ret

.global _start
_start:
  xor rbp,rbp
  xor r9,r9
  pop rdi     /* argc */
  mov rsi,rsp /* argv */
  call main
  call exit

Is this understanding correct?这种理解正确吗？ GCC use the symbols defined in start.S for the syscalls, then the program starts in _start and calls main from the C file? GCC 使用start.S中定义的符号进行系统调用，然后程序在_start中启动并从 C 文件中调用main ？

Also how does the separate httpd.asm custom binary work?另外，单独的httpd.asm自定义二进制文件是如何工作的？ Just hand-optimized assembly combining the C source and start assembly?只是结合 C 源并开始组装的手动优化组装？

Answer 1

(I cloned the repo and tweaked the.c and.S to compile better with clang -Oz: 992 bytes, down from the original 1208 with gcc. See the WIP-clang-tuning branch in my fork, until I get around to cleaning that up and sending a pull request. With clang, inline asm for the syscalls does save size overall, especially once main has no calls and no rets. IDK if I want to hand-golf the whole .asm after regenerating from compiler output; there are certainly chunks of it where significant savings are possible, eg using lodsb in loops.) (I cloned the repo and tweaked the.c and.S to compile better with clang -Oz: 992 bytes, down from the original 1208 with gcc. See the WIP-clang-tuning branch in my fork, until I get around to cleaning启动并发送拉取请求. 使用 clang, 系统调用的内联 asm确实节省了整体大小, 特别是一旦 main 没有调用也没有 rets. 如果我想在从编译器 output 重新生成后手动打高尔夫球整个.asm ; 那里肯定是其中的一部分，可以显着节省，例如在循环中使用lodsb 。）

It looks like they need r9 to be 0 before a call to any of these labels, either with a register global var or maybe gcc -ffixed-r9 to tell GCC to keep its hands off that register permanently .在调用这些标签中的任何一个之前，他们似乎需要r9为0 ，或者使用寄存器 global var 或者gcc -ffixed-r9来告诉 GCC 永远不要干涉该寄存器。 Otherwise GCC would have left whatever garbage in r9 , just like other registers.否则 GCC 会在r9中留下任何垃圾，就像其他寄存器一样。

Their functions are declared with normal prototypes, not 6 args with dummy 0 args to get every call site to actually zero r9 , so that's not how they're doing it.他们的函数是用普通原型声明的，而不是用 6 个 args 和0虚拟参数来让每个调用站点实际上为零r9 ，所以这不是他们的做法。

special way of encoding syscalls编码系统调用的特殊方式

I wouldn't describe that as "encoding syscalls".我不会将其描述为“编码系统调用”。 Maybe " defining syscall wrapper functions".也许“定义系统调用包装函数”。 They're defining their own wrapper function for each syscall, in an optimized way that falls through into one common handler at the bottom.他们正在为每个系统调用定义自己的包装器 function，以一种优化的方式进入底部的一个通用处理程序。 In the C compiler's asm output, you'll still see call write .在 C 编译器的 asm output 中，您仍然会看到call write 。

(It might have been more compact for the final binary to use inline asm to let the compiler inline a syscall instruction with the args in the right registers, instead of making it look like a normal function that clobbers all the call-clobbered registers. Especially if compiled with clang -Oz which would use 3-byte push 2 / pop rax instead of 5-byte mov eax, 2 to set up the call number. push imm8 / pop / syscall is the same size as call rel32 .) （对于最终的二进制文件来说，使用 inline asm 让编译器在正确的寄存器中使用 args 内联syscall指令可能会更紧凑，而不是让它看起来像一个普通的 function 来破坏所有调用破坏的寄存器。尤其是如果使用 clang -Oz编译，它将使用 3 字节的push 2 / pop rax而不是 5 字节的mov eax, 2来设置索书号。push push imm8 / pop / syscall与call rel32的大小相同。）

Yes, you can define functions in hand-written asm with .global foo / foo: .是的，您可以使用.global foo / foo:在手写 asm 中定义函数。 You could look at this as one large function with multiple entry points for different syscalls.您可以将其视为一个大型 function ，具有用于不同系统调用的多个入口点。 In asm, execution always passes to the next instruction, regardless of labels, unless you use a jump/call/ret instruction.在 asm 中，无论标签如何，执行总是传递到下一条指令，除非您使用 jump/call/ret 指令。 The CPU doesn't know about labels. CPU 不知道标签。

So it's just like a C switch(){} statement without break;所以它就像一个没有中断的 C switch(){}语句break; between case: labels, or like C labels you can jump to with goto .在case:标签之间，或者像 C 标签，您可以使用goto跳转到。 Except of course in asm you can do this at global scope, while in C you can only goto within a function.当然，除了在 asm 中，您可以在全局 scope 中执行此操作，而在 C 中，您只能在 function 中执行此操作。 And in asm you can call instead of just goto ( jmp ).在 asm 中，您可以call而不是goto ( jmp )。

    static long callnum = 0;     // r9 = 0  before a call to any of these

    ...
    socket:
       callnum += 38;
    close:
       callnum++;         // can use inc instead of add 1
    open:                 // missed optimization in their asm
       callnum++;
    write:
       callnum++;
    read:
       tmp=callnum;
       callnum=0;
       retval = syscall(tmp, args);

Or if you recast this as a chain of tailcalls, where we can omit even the jmp foo and instead just fall through: C like this truly could compile to the hand-written asm, if you had a smart enough compiler.或者，如果您将其重铸为尾调用链，我们甚至可以省略jmp foo而只是失败：如果您有足够聪明的编译器，像这样的 C 确实可以编译为手写 asm。 (And you could solve the arg-type （你可以解决 arg-type

register long callnum asm("r9");     // GCC extension

long open(args...) {
   callnum++;
   return write(args...);
}
long write(args...) {
   callnum++;
   return read(args...); // tailcall
}
long read(args...){
       tmp=callnum;
       callnum=0;            // reset callnum for next call
       return syscall(tmp, args...);
}

args... are the arg-passing registers (RDI, RSI, RDX, RCX, R8) which they simply leave unmodified. args...是 arg 传递寄存器（RDI、RSI、RDX、RCX、R8），它们只是保持不变。 R9 is the last arg-passing register for x86-64 System V, but they didn't use any syscalls that take 6 args. R9 是 x86-64 System V 的最后一个参数传递寄存器，但他们没有使用任何需要 6 个参数的系统调用。 setsockopt takes 5 args so they couldn't skip the mov r10, rcx . setsockopt需要 5 个参数，因此他们无法跳过mov r10, rcx 。 But they were able to use r9 for something else, instead of needing it to pass the 6th arg.但是他们能够将 r9 用于其他事情，而不是需要它来传递第 6 个参数。

That's amusing that they're trying so hard to save bytes at the expense of performance, but still use xor rbp,rbp instead of xor ebp,ebp .有趣的是，他们如此努力地以牺牲性能为代价来节省字节，但仍然使用xor rbp,rbp而不是xor ebp,ebp 。 Unless they build with gcc -Wa,-Os start.S , GAS won't optimize away the REX prefix for you.除非他们使用gcc -Wa,-Os start.S构建，否则 GAS 不会为您优化 REX 前缀。 ( Does GCC optimize assembly source file? ) （ GCC 是否优化汇编源文件？）

They could save another byte with xchg rax, r9 (2 bytes including REX) instead of mov rax, r9 (REX + opcode + modrm).他们可以用xchg rax, r9 （包括 REX 的 2 个字节）而不是mov rax, r9 （REX + opcode + modrm）保存另一个字节。 ( Code golf.SE tips for x86 machine code ) （代码 golf.SE 提示 x86 机器代码）

I'd also have used xchg eax, r9d because I know Linux system call numbers fit in 32 bits, although it wouldn't save code size because a REX prefix is still needed to encode the r9d register number.我也使用过xchg eax, r9d因为我知道 Linux 系统调用号适合 32 位，尽管它不会节省代码大小，因为仍然需要 REX 前缀来编码r9d寄存器号。 Also, in the cases where they only need to add 1, inc r9d is only 3 bytes, vs. add r9d, 1 being 4 bytes (REX + opcode + modrm + imm8).此外，在它们只需要加 1 的情况下， inc r9d仅为 3 个字节，而add r9d, 1为 4 个字节（REX + opcode + modrm + imm8）。 (The no-modrm short-form encoding of inc is only available in 32-bit mode; in 64-bit mode it's repurposed as a REX prefix.) （ inc的 no-modrm 短格式编码仅在 32 位模式下可用；在 64 位模式下，它被重新用作 REX 前缀。）

mov rsi,rsp could also save a byte as push rsp / pop rsi (1 byte each) instead of 3-byte REX + mov. mov rsi,rsp也可以将一个字节保存为push rsp / pop rsi （每个 1 个字节），而不是 3 字节的 REX + mov。 That would make room for returning main's return value with xchg edi, eax before call exit .这将为在call exit之前使用xchg edi, eax返回 main 的返回值腾出空间。

But since they're not using libc, they could inline that exit , or put the syscalls below _start so they can just fall into it, because exit happens to be the highest-numbered syscall!但是由于他们没有使用 libc，他们可以内联该exit ，或者将系统调用放在_start下面，这样他们就可以陷入其中，因为exit恰好是编号最高的系统调用！ Or at least jmp exit since they don't need stack alignment, and jmp rel8 is more compact than call rel32 .或者至少jmp exit因为他们不需要堆栈 alignment，并且jmp rel8比call rel32更紧凑。

Also how does the separate httpd.asm custom binary work?另外，单独的 httpd.asm 自定义二进制文件是如何工作的？ Just hand-optimized assembly combining the C source and start assembly?只是结合 C 源并开始组装的手动优化组装？

No, that's fully stand-alone incorporating the start.S code ( at the ?_017: label ), and maybe hand-tweaked compiler output.不，这是完全独立的包含 start.S 代码（在?_017: label 处），并且可能是手动调整的编译器 output。 Perhaps from hand-tweaking disassembly of a linked executable , hence not having nice label names even for the part from the hand-written asm.也许来自链接可执行文件的手动调整反汇编，因此即使对于来自手写 asm 的部分也没有很好的 label 名称。 (Specifically, from Agner Fog's objconv , which uses that format for labels in its NASM-syntax disassembly.) （具体来说，来自Agner Fog 的objconv ，它在其 NASM 语法反汇编中使用该格式作为标签。）

(Ruslan also pointed out stuff like jnz after cmp , instead of jne which has the more appropriate semantic meaning for humans, so another sign of it being compiler output, not hand-written.) （Ruslan 还在cmp之后指出了jnz之类的东西，而不是jne对人类具有更合适的语义含义，因此另一个迹象表明它是编译器 output，而不是手写。）

I don't know how they arranged to get the compiler not to touch r9 .我不知道他们如何安排让编译器不要碰r9 。 It seems just luck.似乎只是运气。 The readme indicates that just compiling the.c and.S works for them, with their GCC version.自述文件表明只需编译 .c 和 .S 就可以使用它们的 GCC 版本。

As far as the ELF headers, see the comment at the top of the file, which links A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux - you'd assemble this with nasm -fbin and the output is a complete ELF binary, ready to run.至于 ELF 标头，请参阅文件顶部的注释，该注释链接了 A Whirlwind Tutorial on Creating Seriously Teensy ELF Executables for Linux - 你可以使用nasm -fbin和 output 组装它是一个完整的 ELF 二进制文件，跑步。 Not ao that you need to link + strip, so you get to account for every single byte in the file.不需要链接+剥离，因此您可以考虑文件中的每个字节。

Answer 2

You're pretty much correct about what's going on.你对正在发生的事情非常正确。 Very interesting, I've never seen something like this before.非常有趣，我以前从未见过这样的东西。 But basically as you said, every time it calls the label, as you said, r9 keeps adding up until it reaches read , whose syscall number is 0. This is why the order is pretty clever.但基本上如你所说，每次调用 label 时，如你所说， r9不断累加，直到达到read ，其系统调用号为 0。这就是该命令非常聪明的原因。 Assuming r9 is 0 before read is called (the read label itself zeroes r9 before calling the correct syscall), no adding is needed because r9 already has the correct syscall number that is needed.假设在调用read之前r9为 0（在调用正确的系统调用之前， read label 本身将r9归零），不需要添加，因为r9已经具有所需的正确系统调用号。 write 's syscall number is 1, so it only needs to be added by 1 from 0, which is shown in the macro call. write的系统调用号为1，所以只需要从0加1即可，如宏调用所示。 open 's syscall number is 2, so first it is added by 1 at the open label, then again by 1 at the write label, and then the correct syscall number is put into rax at the read label. open的系统调用号是 2，所以首先在open label 时将其加 1，然后在write label 时再次加 1，然后在read ZD304BA20E96D5E34Z11 时将正确的系统调用号放入rax 。 And so on.等等。 Parameter registers like rdi , rsi , rdx , etc. are also not touched so it basically acts like a normal function call.像rdi 、 rsi 、 rdx等参数寄存器也没有被触及，所以它基本上就像一个普通的 function 调用。

Also how does the separate httpd.asm custom binary work?另外，单独的 httpd.asm 自定义二进制文件是如何工作的？ Just hand-optimized assembly combining the C source and start assembly?只是结合 C 源并开始组装的手动优化组装？

I'm assuming you're talking about this file .我假设你在谈论这个文件。 Not sure exactly what's going on here, but it looks like an ELF file is manually being created, probably to reduce size further.不确定这里到底发生了什么，但看起来像是手动创建了一个 ELF 文件，可能是为了进一步减小大小。

这个没有 libc 的 C 程序如何工作？

问题描述

2 个解决方案

解决方案1
11 已采纳 2021-03-29 17:54:05

解决方案2
6 2021-03-29 09:22:32

这个没有 libc 的 C 程序如何工作？

问题描述

2 个解决方案

解决方案1 11 已采纳 2021-03-29 17:54:05

解决方案2 6 2021-03-29 09:22:32

解决方案1
11 已采纳 2021-03-29 17:54:05

解决方案2
6 2021-03-29 09:22:32