Where and why do read(2) and write(2) system calls copy to and from userspace?

Question

I was reading about sendfile(2) recently, and the man page states:

 sendfile() copies data between one file descriptor and another. Because this copying is done within the kernel, sendfile() is more efficient than the combination of read(2) and write(2), which would require transferring data to and from user space.

It got me thinking, why exactly is the combination of read() / write() slower? The man page focuses on extra copying that has to happen to and from userspace, not the total number of calls required. I took a short look at the kernel code for read and write but didn't see the copy.

Why does the copy exist in the first place? Couldn't the kernel just read from the passed buffer on a write() without first copying the whole thing into kernel space?

What about asynchronous IO interfaces like AIO and io_uring ? Do they also copy?

Answer 1

why exactly is the combination of read() / write() slower?

The manual page is quite clear about this. Doing read() and then write() requires to copy the data two times.

Why does the copy exist in the first place?

It should be quite obvious: since you invoke read , you want the data to be copied to the memory of your process, in the specified destination buffer. Same goes for write : you want the data to be copied from the memory of your process. The kernel doesn't really know that you just want to do a read + write , and that copying back and forth two times could be avoided.

When executing read , the data is copied by the kernel from the file descriptor to the process memory. When executing write the data is copied by the kernel from the process memory to the file descriptor.

Couldn't the kernel just read from the passed buffer on a write() without first copying the whole thing into kernel space?

The crucial point here is that when you read or write a file, the file has to be mapped from disk to memory by the kernel in order for it to be read or written. This is called memory-mapped file I/O , and it's a huge factor in the performance of modern operating systems.

The file content is already present in kernel memory, mapped as a memory page (or more). In case of a read , the data needs to be copied from that file kernel memory page to the process memory, while in case of a write , the data needs to be copied from the process memory to the file kernel memory page. The kernel will then ensure that the data in the kernel memory page(s) corresponding to the file is correctly written back to disk when needed ( if needed at all).

This "intermediate" kernel mapping can be avoided, and the file mapped directly into userspace memory, but then the application would have to manage it manually, which is complicated and easy to mess up. This is why, for normal file operations, files are mapped into kernel memory. The kernel provides high level APIs for userspace programs to interact with them, and the hard work is left to the kernel itself.

The sendfile syscall is much faster because you do not need to perform the copy two times, but only once. Assuming that you want to do a sendfile of file A to file B , then all the kernel needs to do is to copy the data from A to B . However, in the case of read + write , the kernel needs to first copy from A to your process, and then from your process to B . This double copy is of course slower, and if you don't really need to read or manipulate the data, then it's a complete waste of time.

FYI, sendfile itself is basically an easy-to-use wrapper around splice (as can bee seen from the source code ), which is a more generic syscall to perform zero-copy data transfer between file descriptors.

I took a short look at the kernel code for read and write but didn't see the copy.

In terms of kernel code, the whole process for reading a file is very complicated, but what the kernel ends up doing is a "special" version of memcpy() , called copy_to_user() , which copies the content of the file from the kernel memory to the userspace memory (doing the appropriate checks before performing the actual copy). More specifically, for files, the copyout() function is used, but the behavior is very similar, both end up calling raw_copy_to_user() (which is architecture-dependent).

What about asynchronous IO interfaces like AIO and io_uring ? Do they also copy?

The aio_{read,write} libc functions defined by POSIX are just asynchronous wrappers around read and write (ie they still use read and write under the hood). These still copy data to/from userspace.

io_uring can provide zero-copy operations, when using the O_DIRECT flag of open (see the manual page ):

 O_DIRECT (since Linux 2.4.10) Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user- space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion.

This should be done carefully though, as it could very well degrade performance in case the userspace application does not do the appropriate caching on its own (if needed).

See also this related detailed answer on asynchronous I/O, and this LWN article on io_uring .

Where and why do read(2) and write(2) system calls copy to and from userspace?

Question

1 answers

solution1
7 2020-04-30 11:35:34

Where and why do read(2) and write(2) system calls copy to and from userspace?

Question

1 answers

solution1 7 2020-04-30 11:35:34

solution1
7 2020-04-30 11:35:34