简体繁体 English

Linux 内核如何“聆听”C 库？

[英]How does the Linux kernel “listen” to the C library?

原文 2015-04-29 12:05:18 1 2 c/ linux-kernel/ system-calls/ userspace

I'm trying to build up a "big picture" of how things work in the Linux kernel and userspace, and I'm quite confused.我试图建立一个关于 Linux 内核和用户空间中事物如何工作的“大图”，我很困惑。 I know that userspace make use of system calls to "talk" to the kernel, but I don't know how.我知道用户空间利用系统调用与内核“对话”，但我不知道如何。 I tried to read the C library and kernel source codes but they are complex and not easy to understand.我试图阅读 C 库和内核源代码，但它们很复杂且不易理解。 I've also read several books regarding conceptual facts about operating systems, like managing processes, memory, devices, but they don't make the "transition" (userspace->kernel) clear.我还阅读了几本有关操作系统概念性事实的书籍，例如管理进程、内存、设备，但它们并没有使“转换”（用户空间-> 内核）变得清晰。 So, where exactly the transition between the userspace and kernel space happens?那么，用户空间和内核空间之间的转换究竟发生在哪里呢？ How does the C library run a code that's inside the Linux kernel running in the machine? C 库如何运行机器中运行的 Linux 内核中的代码？

To make an analogy: imagine that there is a house.打个比方：假设有一所房子。 The house is locked.房子是锁着的。 The key to open the house is inside the house itself.打开房子的钥匙就在房子里面。 There's only one person inside the house, the kernel.屋子里只有一个人，内核。 The userspace is someone trying to enter the house.用户空间是试图进入房子的人。 My question would be: how does the kernel knows there's someone outside the house wanting the key, and which mechanism allows the house to be opened with that key?我的问题是：内核如何知道屋外有人想要钥匙，以及哪种机制允许用那把钥匙打开房子？

2 个解决方案

That's quiet easy - the person can use the doorbell to let the kernel know it's waiting outside.这很简单——这个人可以使用门铃让内核知道它在外面等着。 And this doorbell in our case is usually a special CPU exception, software interrupt or dedicated instruction that a user-space application is allowed to use and the kernel can handle.在我们的例子中，这个门铃通常是一个特殊的 CPU 异常、软件中断或专用指令，允许用户空间应用程序使用并且内核可以处理。

So the procedure is like this:所以程序是这样的：

First you need to know the system call number.首先你需要知道系统调用号。 Each syscall has its unique number and there is a table inside of the kernel that maps those numbers to specific functions.每个系统调用都有其唯一的编号，并且内核中有一个表格将这些编号映射到特定的函数。 Each architecture can have different table entries for the same number.对于相同的数字，每个体系结构可以有不同的表条目。 On two different architectures the same number may map to different syscalls.在两种不同的体系结构上，相同的数字可能映射到不同的系统调用。
Then you set up your arguments.然后你设置你的论点。 This is also architecture specific but is not much different from passing arguments between usual function calls.这也是特定于架构的，但与在通常的函数调用之间传递参数没有太大区别。 Usually, you will put your arguments in specific CPU registers.通常，您会将参数放在特定的 CPU 寄存器中。 This is described in the ABI of this architecture.此架构的 ABI 中对此进行了描述。
Then you enter syscall.然后你进入系统调用。 Depending on the architecture this may mean causing some exception or executing a dedicated CPU instruction.根据架构，这可能意味着导致某些异常或执行专用 CPU 指令。
The kernel has special handler function that runs in kernel mode when a syscall is called.内核具有特殊的处理函数，当系统调用被调用时，它会在内核模式下运行。 It will pause process execution, storing all the information specific to this process (this is called context switch ), read the syscall number and arguments and call proper syscall routine.它将暂停进程执行，存储特定于该进程的所有信息（这称为context switch ），读取系统调用号和参数并调用适当的系统调用例程。 It will also make sure to put the return value in proper place for user-space to read and to schedule the process back when the syscall routine is done (restoring its context).它还将确保将返回值放在适当的位置供用户空间读取并在系统调用例程完成时（恢复其上下文）调度进程。

As an example, to let the kernel know you want to call syscall on x86_64 you can use sysenter instruction with syscall number in %rax register.例如，要让内核知道您想在 x86_64 上调用 syscall，您可以使用sysenter指令和%rax寄存器中的 syscall 编号。 Arguments are passed using registers (if I remember correctly) %rdi , %rsi , %rdx , %rcx , %r8 and %r9 .参数是使用寄存器（如果我没记错的话） %rdi 、 %rsi 、 %rdx 、 %rcx 、 %r8和%r9 %rcx 。

You could also use an older way that was used on 32 bit x86 CPUs - a software interrupt number 0x80 ( int 0x80 instruction).您还可以使用在 32 位 x86 CPU 上使用的旧方法 - 软件中断号 0x80（ int 0x80指令）。 Again, syscall number is specified in %rax register and arguments go to (again, if I'm not mistaken) %ebx , %ecx , %edx , %esi , %edi , %ebp .同样，系统调用号在%rax寄存器中指定，参数转到（再次，如果我没记错的话） %ebx , %ecx , %edx , %esi , %edi , %ebp 。

ARM is very similar - you will use "supervisor call" instruction ( SVC #0 ). ARM 非常相似 - 您将使用“主管调用”指令（ SVC #0 ）。 Your syscall number will go to r7 register, all the arguments will go to registers r0-r6 and the return value of syscall will be stored in r0 .您的系统调用号将进入r7寄存器，所有参数将进入寄存器r0-r6 ，系统调用的返回值将存储在r0 。

Other architectures and operating systems use similar techniques.其他体系结构和操作系统使用类似的技术。 The details may vary - software interrupt numbers may be different, arguments may be passed using different registers or even using stack but the core idea is the same.细节可能有所不同——软件中断号可能不同，参数可能使用不同的寄存器甚至使用堆栈传递，但核心思想是相同的。

Many processors have an instruction to call a specific "trap" or "interrupt", the Linux kernel sets up such a "trap" or "interrupt" specifically for systems calls.许多处理器都有调用特定“陷阱”或“中断”的指令，Linux 内核专门为系统调用设置了这样的“陷阱”或“中断”。

The library sets up processor registers in a certain way, and then performs the special trap or interrupt instruction, which causes the processor to enter privileged mode and call the kernel's trap/interrupt handler function, which decodes the values in the registers and calls the appropriate function to handle the system call.该库以某种方式设置处理器寄存器，然后执行特殊的陷阱或中断指令，使处理器进入特权模式并调用内核的陷阱/中断处理函数，该函数对寄存器中的值进行解码并调用相应的函数来处理系统调用。

That is the most common way, and basically how it's done for just about all systems that need isolation between kernel and user-space.这是最常见的方式，基本上是为几乎所有需要在内核和用户空间之间隔离的系统完成的。