简体繁体 English

您如何使用系统调用在汇编中读取具有小缓冲区的大文件？读取 append 是 \0 吗？

[英]How do you read a large file with a small buffer in assembly with system calls? Does read append a \0?

原文 2022-11-21 20:44:38 5 1 linux/ assembly/ file-io/ x86-64/ system-calls

Calling the read syscall for a file larger than the size of my buffer will mean the buffer will only capture the first part of the file.为大于我的缓冲区大小的文件调用读取系统调用将意味着缓冲区将仅捕获文件的第一部分。 Trying to call it again will have no effect, it still only gives first part of the file.尝试再次调用它不会有任何效果，它仍然只提供文件的第一部分。 Say the file is 1 GB and the buffer is 1024 bytes, then we'll only ever access the first 1024 bytes of the large file.假设文件是 1 GB，缓冲区是 1024 字节，那么我们只会访问大文件的前 1024 字节。 Is there any way to access the rest of the file without increasing the buffer size?有什么方法可以在不增加缓冲区大小的情况下访问文件的 rest 吗？

I couldn't find any flag talking about this when you open the file on this website: https://linuxhint.com/list_of_linux_syscalls/#open-flags (unless I misunderstood the descriptions).当你打开这个网站上的文件时，我找不到任何标志谈论这个： https://linuxhint.com/list_of_linux_syscalls/#open-flags （除非我误解了描述）。

I initially thought that the computer would fill the second 1024 bytes when I syscalled for the second time (like it is in C IIRC).我最初认为计算机会在我第二次系统调用时填充第二个 1024 字节（就像在 C IIRC 中一样）。 Well, really I had a text file size of ~1300B and a buffer size of 512B, so it isn't an issue for me to resize in this case, but I wanted to know how it was dealt with in general.嗯，实际上我的文本文件大小约为 1300B，缓冲区大小为 512B，因此在这种情况下调整大小对我来说不是问题，但我想知道一般情况下是如何处理的。

Is there some kind of other syscall to break the file into pieces or to make it into some kind of stream-like object?是否有某种其他系统调用可以将文件分成几部分或将其变成某种类似流的 object？ I know there's a bash split command.我知道有一个 bash 拆分命令。 How do C and my OS deal with files like this? C 和我的操作系统如何处理这样的文件？ C has an option to eat a file with one bite at a time, are they really using a very large buffer underneath? C有一个选项是一口一口吃一个文件，他们真的在下面使用非常大的缓冲区吗？ It feels wasteful to be forced to have the full file copied into a separate buffer and I would be surprised if there was no alternative.被迫将整个文件复制到单独的缓冲区中感觉很浪费，如果没有其他选择，我会感到惊讶。

EDIT: Sorry, It turns out there was no problem with any syscall.编辑：抱歉，事实证明任何系统调用都没有问题。 what happened was that I expected there to be a null byte or some other special character to signify the end of the file and I used that to check when I should stop refilling and printing my buffer, It turns out there wasn't for some reason and what would happen is that the syscall would only change until the end of the file in the buffer and leave the rest of the buffer the same, so when I printed it it looked like it was looping itself and at the end I would see part of it wasn't finished.发生的事情是，我希望有一个 null 字节或其他一些特殊字符来表示文件的结尾，我用它来检查何时应该停止重新填充和打印我的缓冲区，事实证明由于某种原因没有并且会发生什么情况是系统调用只会更改直到缓冲区中的文件末尾并且缓冲区的 rest 保持不变，所以当我打印它时它看起来像是在循环自己并且最后我会看到部分还没有完成。 when in reality it did finish but there was some repeat text from the previous buffer refill after.实际上它确实完成了，但是之前的缓冲区重新填充之后有一些重复的文本。 ~~The book I was reading (Programming from the Ground Up) said the syscall would also add a \0 at the end so I can check for that.~~~~我正在阅读的书（从头开始编程）说系统调用还会在末尾添加一个 \0 以便我可以检查它。~~ ~~It was about 32-bit assembly so the syscall might have changed.~~~~它是关于 32 位汇编的，所以系统调用可能已经改变。~~ [Edit 2: Sorry, Turns out I misread the book. [编辑2：对不起，原来我看错了这本书。 see answer,] Now I'm using the return value of the syscall, which is the length of file the system changed in the buffer.请参阅答案，] 现在我正在使用系统调用的返回值，它是系统在缓冲区中更改的文件的长度。 in order to check when to stop and to print without repeating parts of the previous buffer.为了检查何时停止和打印而不重复前一个缓冲区的部分。

tl;dr - misunderstood a syscall tl;dr - 误解了系统调用

1 个解决方案

What happened was that I first misread the following about reading lines from Programming from the Ground Up and accidently replaced line with file in my head:发生的事情是，我首先误读了以下关于阅读 Programming from the Ground Up 中的行的内容，并且不小心将行替换为我脑海中的文件：

For an example, let's say that you want to read in a single line of text from a file but you do not know how long that line is.例如，假设您想从文件中读取一行文本，但您不知道该行有多长。 You would then simply read a large number of bytes/characters from the file into a buffer, look for the end-of-line character, and copy all of the characters to that end-of-line character to another location.然后，您只需将文件中的大量字节/字符读入缓冲区，查找行尾字符，然后将所有字符复制到该行尾字符到另一个位置。 If you didn't find an end-of-line character, you would allocate another buffer and continue reading.如果您没有找到行尾字符，您将分配另一个缓冲区并继续阅读。 You would probably wind up with some characters left over in your buffer in this case, which you would use as the starting point when you next need data from the file.在这种情况下，您可能会在缓冲区中留下一些字符，下次需要文件中的数据时，您可以将其用作起点。

When in reality a few paragraphs before it stated that:实际上，它之前的几段是这样说的：

The write system call will give back the number of bytes written in %eax or an error code. write 系统调用将返回写入 %eax 的字节数或错误代码。

Without mentioning anything about null bytes.没有提及有关 null 字节的任何内容。 If I had read the program, I would have also realised my mistake.如果我看过这个程序，我也会意识到我的错误。 Or if I had increased my buffer size to larger than the file's, I think.或者，如果我将缓冲区大小增加到大于文件的大小，我想。

For what happened in my code: I expected there to be a null byte or some other special character to signify the end of the file and I used that to check when I should stop refilling and printing my buffer.对于我的代码中发生的事情：我希望有一个 null 字节或其他一些特殊字符来表示文件的结尾，我用它来检查何时应该停止重新填充和打印缓冲区。 The syscall would only change until the end of the file in the buffer and leave the rest of the buffer the same, so when I printed it it would never stop and at the end of each buffer write I would see part of it wasn't finished, when in reality it did finish but there was some repeat text from the previous buffer refill after.系统调用只会在缓冲区中的文件末尾发生变化，并使缓冲区的 rest 保持不变，因此当我打印它时它永远不会停止并且在每个缓冲区写入结束时我会看到它的一部分不是完成了，实际上它确实完成了，但是之前的缓冲区重新填充之后有一些重复的文本。

Well, technically I realise now the buffer only gets refilled once at the end, after that the reads don't change the buffer at all and I'm just rewriting that last buffer until I stop the program.好吧，从技术上讲，我现在意识到缓冲区最后只会被重新填充一次，之后读取根本不会改变缓冲区，我只是重写最后一个缓冲区，直到我停止程序。