并行磁盘I / O.

Question

I have several logfiles that I would like to read. 我有几个我想阅读的日志文件。 Without loss of generality, let's say the logfile processing is done as follows: 不失一般性，让我们说日志文件处理如下：

def process(infilepath):
    answer = 0
    with open (infilepath) as infile:
        for line in infile:
            if line.startswith(someStr):
                answer += 1
    return answer

Since I have a lot of logfiles, I wanted to throw multiprocessing at this problem (my first mistake: I should have probably used multi-threading; someone please tell me why) 由于我有很多日志文件，我想在这个问题上抛出多处理（我的第一个错误：我应该使用多线程;有人请告诉我原因）

While doing so, it occurred to me that any form of parallel processing should be effectively useless here, since I'm constrained by the fact that there is only one read head on my HDD, and therefore, only one file may be read at a time. 在这样做的时候，我想到任何形式的并行处理在这里都应该没用，因为我受到硬盘上只有一个读头的限制，因此，只能读取一个文件。时间。 In fact, under this reasoning, due to the fact that lines from different files may be requested simultaneously, the read head may need to move significantly from time to time, causing the multiproc approach to be slower than a serial approach. 实际上，在这种推理下，由于可以同时请求来自不同文件的行的事实，读头可能需要不时地移动，导致多过程方法比串行方法慢。 So I decided to go back to a single process to read my logfiles. 所以我决定回到一个进程来读取我的日志文件。

Interestingly though, I noticed that I did get a speedup with small files (<= 40KB), and that it was only with large files (>= 445MB) that the expected slow-down was noticed. 有趣的是，我注意到我确实得到了一个带有小文件（<= 40KB）的加速，并且只有大文件（> = 445MB）才注意到预期的减速。

This leads me to believe that python may read files in chunks, whose size exceeds more than the one line I request at a time. 这让我相信python可以读取块中的文件，块的大小超过我一次请求的一行。

Q1: So what is the file-reading mechanism under the hood? Q1：那么引擎盖下的文件读取机制是什么？

Q2: What is the best way to optimize the reading of files from a conventional HDD ? Q2：优化传统硬盘文件读取的最佳方法是什么？

Technical specs: 技术规格：

python3.3 python3.3
5400rpm conventional HDD 5400rpm传统硬盘
Mac OSX 10.9.2 (Mavericks) Mac OSX 10.9.2（小牛队）

Answer 1

The observed behavior is a result of: 观察到的行为是以下结果：

BufferedIO BufferedIO
a scheduling algorithm that decides the order in which the requisite sectors of the HDD are read 调度算法，用于确定读取HDD的必需扇区的顺序

BufferedIO BufferedIO

Depending on the OS and the read block size, it is possible for the entire file to fit into one block, which is what is read in a single read command. 根据操作系统和读取块大小，整个文件可以放入一个块中，这是在单个读取命令中读取的块。 This is why the smaller files are read more easily 这就是为什么更容易读取较小的文件

Scheduling Algorithm 调度算法

Larger files (filesize > read block size), have to be read in block size chunks. 必须以block size块读取较大的文件（文件大小>读取块大小）。 Thus, when a read is requested on each of several files (due to the multiprocessing), the needle has to move to different sectors (corresponding to where the files live) of the HDD. 因此，当对若干文件中的每一个请求读取时（由于多处理），针必须移动到HDD的不同扇区（对应于文件所在的位置）。 This repetitive movement does two things: 这种重复的运动做了两件事：

increase the time between successive reads on the same file 增加连续读取同一文件之间的时间
throw off the read-sector predictor, as a file may span multiple sectors 抛弃读取扇区预测器，因为文件可能跨越多个扇区

The time between successive reads of the same file matters if the computation performed on a chunk of lines completes before the read head can provide the next chunk of lines from the same file, the process simply waits until another chunk of lines becomes available. 如果在读取头可以提供来自同一文件的下一行行之前在一行行上执行的计算完成，则连续读取相同文件之间的时间很重要，该过程简单地等待直到另一行可用。 This is one source of slowdowns 这是减速的一个原因

Throwing off the read-predictor is bad for pretty much the same reasons as why throwing off the branch predictor is bad . 抛出读预测器是不好的，原因与抛弃分支预测器的原因差不多。

With the combined effect of these two issues, processing many large files in parallel would be slower than processing them serially. 由于这两个问题的综合影响，并行处理许多大文件比串行处理它们要慢。 Of course, this is more true when processing blockSize many lines finishes before numProcesses * blockSize many lines can be read out of the HDD 当然，在处理blockSize时，更多的情况更真实，许多行在numProcesses * blockSize之前numProcesses * blockSize可以从HDD中读出许多行

Answer 2

another idea would be to profile your code 另一个想法是分析你的代码

try:
    import cProfile as profile
except ImportError:
    import profile

profile.run("process()")

Answer 3

here is an example of using a memory map file 这是使用内存映射文件的示例

import mmap 
with open("hello.txt", "r+b") as f:
     mapf = mmap.mmap(f.fileno(), 0)
     print(mapf.readline()) 
     mapf.close()
    enter code here

并行磁盘I / O.

问题描述

3 个解决方案

解决方案1
2 已采纳 2015-01-15 14:40:16

BufferedIO BufferedIO

Scheduling Algorithm 调度算法

解决方案2
1 2014-10-26 13:16:41

解决方案3
0 2014-10-25 03:27:50

并行磁盘I / O.

问题描述

3 个解决方案

解决方案1 2 已采纳 2015-01-15 14:40:16

BufferedIO BufferedIO

Scheduling Algorithm 调度算法

解决方案2 1 2014-10-26 13:16:41

解决方案3 0 2014-10-25 03:27:50

解决方案1
2 已采纳 2015-01-15 14:40:16

解决方案2
1 2014-10-26 13:16:41

解决方案3
0 2014-10-25 03:27:50