简体繁体 English

什么更有效，从文件中逐字读取或一次读取一行并使用C分割字符串？

[英]What is more efficient, reading word by word from file or reading a line at a time and splitting the string using C ?

原文 2013-01-22 04:05:49 1 6 c/ file/ processing-efficiency

I want to develop an application in C where I need to check word by word from a file on disk. 我想在C中开发一个应用程序，我需要从磁盘上的文件逐字检查。 I've been told that reading a line from file and then splitting it into words is more efficient as less file accesses are required. 我被告知，从文件中读取一行然后将其拆分为单词会更有效，因为需要更少的文件访问。 Is it true? 这是真的吗？

6 个解决方案

If you know you're going to need the entire file, you may as well be reading it in as large chunks as you can (at the extreme end, you'll memory map the entire file into memory in one go). 如果你知道你将需要整个文件，你也可以尽可能地以大块的形式阅读它（在极端情况下，你将内存将整个文件一次性映射到内存中）。 You are right that this is because less file accesses are needed. 你是对的，这是因为需要更少的文件访问。

But if your program is not slow, then write it in the way that makes it the fastest and most bug free for you to develop. 但是如果你的程序运行速度不慢，那就把它写成最快且最无bug的开发方式。 Early optimization is a grievous sin. 早期优化是一个严重的罪。

Not really true, assuming you're going to be using scanf() and your definition of 'word' matches what scanf() treats as a word. 不是真的，假设您将使用scanf()并且您对'word'的定义与scanf()将其视为单词相匹配。

The standard I/O library will buffer the actual disk reads, and reading a line or a word will have essentially the same I/O cost in terms of disk accesses. 标准I / O库将缓冲实际磁盘读取，读取行或字将在磁盘访问方面具有基本相同的I / O开销。 If you were to read big chunks of a file using fread() , you might get some benefit — at a cost in complexity. 如果你使用fread()读取文件的大块，你可能会获得一些好处 - 但代价是复杂性。

But for reading words, it's likely that scanf() and a protective string format specifier such as %99s if your array is char word[100]; 但是对于阅读单词，如果您的数组是char word[100]; ，则很可能是scanf()和保护字符串格式说明符，例如%99s char word[100]; would work fine and is probably simpler to code. 工作正常，编码可能更简单。

If your definition of word is more complex than the definition supported by scanf() , then reading lines and splitting is probably easier. 如果你对word的定义比scanf()支持的定义更复杂，那么读取行和拆分可能更容易。

As far as splitting is concerned there is no difference with respect to performance. 就分裂而言，在性能方面没有差别。 You are splitting using whitespace in one case and newline in another. 您正在使用一个案例中的空格和另一个案例中的换行符进行拆分。

However it would impact in case of word in a way that you would need to allocate buffers M times, while in case of lines it will be N times, where M>N. 但是，如果单词的情况需要分配缓冲区M次，它将会影响，而在行的情况下，它将是N次，其中M> N. So if you are adopting word split approach, try to calculate total memory need first, allocate that much chunk (so you don't end up with fragmented M chunks), and later get M buffers from that chunk. 因此，如果您采用单词拆分方法，请首先尝试计算总内存需求，分配那么多块（因此您不会最终得到碎片化的M块），然后从该块中获取M个缓冲区。 Note that same approach can be applied in lines split but the difference will be less visible. 请注意，相同的方法可以应用于分割线，但差异将不太明显。

This is correct, you should read them in to a buffer, and then split into whatever you define as 'words'. 这是正确的，您应该将它们读入缓冲区，然后拆分为您定义为“单词”的内容。 The only case where this would not be true is if you can get fscanf() to correctly grab out what you consider to be words (doubtful). 唯一不会出现这种情况的情况是，如果你能让fscanf()正确地抓住你认为是单词的话（可疑）。

The major performance bottlenecks will likely be: 主要的性能瓶颈可能是：

Any call to a stdio file I/O function. 对stdio文件I / O函数的任何调用。 The less calls, the less overhead. 呼叫越少，开销越少。
Dynamic memory allocation. 动态内存分配。 Should be done as scarcely as possible. 应该尽可能地完成。 Ultimately, a lot of calls to malloc will cause heap segmentation. 最终，对malloc的大量调用将导致堆分割。

So what it boils down to is a classic programming consideration: you can get either quick execution time or you can get low memory usage. 因此，它归结为经典的编程考虑因素：您可以获得快速执行时间，也可以获得较低的内存使用率。 You can't get both, but you can find some suitable middle-ground that is most effective both in terms of execution time and memory consumption. 你不能同时获得这两者，但你可以找到一些在执行时间和内存消耗方面最有效的合适中间地带。

To one extreme, the fastest possible execution can be obtained by reading the whole file as one big chunk and upload it to dynamic memory. 一个极端，可以通过将整个文件作为一个大块读取并将其上传到动态内存来获得最快的执行。 Or to the other extreme, you can read it byte by byte and evaluate it as you read, which might make the program slower but will not use dynamic memory at all. 或者到另一个极端，您可以逐字节读取它并在读取时对其进行评估，这可能会使程序变慢但根本不会使用动态内存。

You will need a fundamental knowledge of various CPU-specific and OS-specific features to optimize the code most effectively. 您需要具备各种CPU特定和操作系统特定功能的基础知识，才能最有效地优化代码。 Issues like alignment, cache memory layout, the effectiveness of the underlying API function calls etc etc will all matter. 对齐，缓存内存布局，底层API函数调用的有效性等问题都很重要。

Why not try a few different ways and benchmark them? 为什么不尝试几种不同的方式并对它们进

Not actually answer to your exact question (words vs lines), but if you need all words in memory at the same time anyway, then the most efficient approach is this: 实际上并没有回答你的确切问题（单词与行），但如果你无论如何都需要同时在内存中的所有单词，那么最有效的方法是：

determine file size 确定文件大小
allocate buffer for entire file plus one byte 为整个文件分配缓冲区加一个字节
read entire file to the buffer, and put '\\0' to the extra byte. 将整个文件读入缓冲区，并将'\\0'放入额外字节。
make a pass over it and count how many words it has 对它进行传递并计算它有多少单词
allocate char* (pointers to words) or int (indexes to buffer) index array, with size matching word count 分配char* （指向单词的指针）或int （索引到缓冲区）索引数组，大小匹配字数
make 2nd pass over buffer, and store addresses or indexes to the first letters of words to the index array, and overwrite other bytes in buffer with '\\0' (end of string marker). 对缓冲区进行第二次传递，并将地址或索引存储到索引数组的第一个字母，并使用'\\0' （字符串结束标记）覆盖缓冲区中的其他字节。

If you have plenty of memory, then it's probably slightly faster to just assume the worst case for number of words: (filesize+1) / 2 (one letter words with one space in between, with odd number of bytes in file). 如果你有足够的内存，那么假设单词数最坏的情况可能会稍快一些： (filesize+1) / 2 （一个字母单词之间有一个空格，文件中有奇数个字节）。 Also taking the Java ArrayList or Qt QVector approach with the index array, and using realloc() to double it's size when word count exceeds current capacity, will be quite efficient (due to doubling=exponential growth, reallocation will not happen many times). 同时将Java ArrayList或Qt QVector方法与索引数组一起使用，并且当字数超过当前容量时使用realloc()将其大小加倍，将非常有效（由于加倍=指数增长，重新分配不会多次发生）。