简体繁体 English

在Python中划分大型文件以进行多处理的最佳方法是什么？

[英]What's the best way to divide large files in Python for multiprocessing?

原文 2009-12-01 00:23:25 3 7 python/ concurrency/ multiprocessing/ bioinformatics

I run across a lot of "embarrassingly parallel" projects I'd like to parallelize with the multiprocessing module. 我遇到了很多“令人尴尬的并行”项目，我想与multiprocessing模块并行化。 However, they often involve reading in huge files (greater than 2gb), processing them line by line, running basic calculations, and then writing results. 但是，它们通常涉及读取大文件（大于2GB），逐行处理，运行基本计算，然后写入结果。 What's the best way to split a file and process it using Python's multiprocessing module? 使用Python的多处理模块拆分文件并处理文件的最佳方法是什么？ Should Queue or JoinableQueue in multiprocessing be used? 是否应该使用multiprocessing Queue或JoinableQueue ？ Or the Queue module itself? 还是Queue模块本身？ Or, should I map the file iterable over a pool of processes using multiprocessing ? 或者，我应该使用multiprocessing处理将文件映射到一个进程池上吗？ I've experimented with these approaches but the overhead is immense in distribution the data line by line. 我已经尝试了这些方法，但是在逐行分配数据方面的开销是巨大的。 I've settled on a lightweight pipe-filters design by using cat file | process1 --out-file out1 --num-processes 2 | process2 --out-file out2 我已经通过使用cat file | process1 --out-file out1 --num-processes 2 | process2 --out-file out2来确定轻量级管道过滤器的设计 cat file | process1 --out-file out1 --num-processes 2 | process2 --out-file out2 cat file | process1 --out-file out1 --num-processes 2 | process2 --out-file out2 , which passes a certain percentage of the first process's input directly to the second input (see this post ), but I'd like to have a solution contained entirely in Python. cat file | process1 --out-file out1 --num-processes 2 | process2 --out-file out2 ，它将第一个进程输入的某个百分比直接传递给第二个输入（参见本文），但我想要一个完全包含在Python中的解决方案。

Surprisingly, the Python documentation doesn't suggest a canonical way of doing this (despite a lengthy section on programming guidelines in the multiprocessing documentation). 令人惊讶的是，Python文档没有提出这样做的规范方法（尽管在multiprocessing文档中有关于编程指南的冗长部分）。

Thanks, Vince 谢谢，文斯

Additional information: Processing time per line varies. 附加信息：每行的处理时间各不相同。 Some problems are fast and barely not I/O bound, some are CPU-bound. 有些问题很快，几乎没有I / O限制，有些是受CPU限制的。 The CPU bound, non-dependent tasks will gain the post from parallelization, such that even inefficient ways of assigning data to a processing function would still be beneficial in terms of wall clock time. CPU绑定的非依赖任务将从并行化获得后期，使得即使是低效的将数据分配给处理功能的方式在挂钟时间方面仍然是有益的。

A prime example is a script that extracts fields from lines, checks for a variety of bitwise flags, and writes lines with certain flags to a new file in an entirely new format. 一个主要的例子是一个脚本，它从行中提取字段，检查各种按位标志，并将具有某些标志的行以全新格式写入新文件。 This seems like an I/O bound problem, but when I ran it with my cheap concurrent version with pipes, it was about 20% faster. 这似乎是一个I / O限制问题，但是当我使用带有管道的廉价并发版本运行它时，速度提高了大约20％。 When I run it with pool and map, or queue in multiprocessing it is always over 100% slower. 当我使用池和映射运行它，或者在multiprocessing排队时，它总是慢100％。

7 个解决方案

One of the best architectures is already part of Linux OS's. 最好的架构之一已经是Linux操作系统的一部分。 No special libraries required. 不需要特殊的库。

You want a "fan-out" design. 你想要一个“扇出”的设计。

A "main" program creates a number of subprocesses connected by pipes. “主”程序创建了许多通过管道连接的子进程。
The main program reads the file, writing lines to the pipes doing the minimum filtering required to deal the lines to appropriate subprocesses. 主程序读取文件，将行写入管道，执行将行处理到适当的子进程所需的最小过滤。

Each subprocess should probably be a pipeline of distinct processes that read and write from stdin. 每个子进程可能应该是从stdin读取和写入的不同进程的管道。

You don't need a queue data structure, that's exactly what an in-memory pipeline is -- a queue of bytes between two concurrent processes. 您不需要队列数据结构，这正是内存中的管道 - 两个并发进程之间的字节队列。

One strategy is to assign each worker an offset so if you have eight worker processes you assign then numbers 0 to 7. Worker number 0 reads the first record processes it then skips 7 and goes on to process the 8th record etc., worker number 1 reads the second record then skips 7 and processes the 9th record......... 一种策略是为每个工作人员分配一个偏移，所以如果你有八个工人进程，你分配数字0到7.工人数字0读取第一个记录处理然后跳过7然后继续处理第8个记录等，工人编号1读取第二条记录然后跳过7并处理第9条记录.........

There are a number of advantages to this scheme. 该方案有许多优点。 It doesnt matter how big the file is the work is always divided evenly, processes on the same machine will process at roughly the same rate, and use the same buffer areas so you dont incur any excessive I/O overhead. 无论文件有多大，工作总是均匀分配，同一台机器上的进程将以大致相同的速率处理，并使用相同的缓冲区，因此不会产生任何过多的I / O开销。 As long as the file hasnt been updated you can rerun individual threads to recover from failures. 只要文件没有更新，您就可以重新运行单个线程以从故障中恢复。

You dont mention how you are processing the lines; 你没有提到你是如何处理线条的; possibly the most important piece of info. 可能是最重要的信息。

Is each line independant? 每条线是否独立？ Is the calculation dependant on one line coming before the next? 计算是否依赖于下一行之前的一条线？ Must they be processed in blocks? 它们必须在块中处理吗？ How long does the processing for each line take? 每条线的处理时间有多长？ Is there a processing step that must incorporate "all" the data at the end? 是否存在必须在末尾包含“全部”数据的处理步骤？ Or can intermediate results be thrown away and just a running total maintained? 或者可以抛弃中间结果并保持运行总量？ Can the file be initially split by dividing filesize by count of threads? 可以通过将文件大小除以线程数来最初拆分文件吗？ Or does it grow as you process it? 或者它在处理过程中会增长吗？

If the lines are independant and the file doesn't grow, the only coordination you need is to farm out "starting addresses" and "lengths" to each of the workers; 如果这些行是独立的并且文件没有增长，那么您需要的唯一协调就是为每个工作者提供“起始地址”和“长度”; they can independantly open and seek into the file and then you must simply coordinate their results; 他们可以独立地打开并寻找文件，然后你必须简单地协调他们的结果; perhaps by waiting for N results to come back into a queue. 也许是等待N个结果回到队列中。

If the lines are not independant, the answer will depend highly on the structure of the file. 如果这些行不是独立的，答案将在很大程度上取决于文件的结构。

It depends a lot on the format of your file. 这在很大程度上取决于文件的格式。

Does it make sense to split it anywhere? 将它拆分到任何地方是否有意义？ Or do you need to split it at a new line? 或者你需要将它拆分为新线？ Or do you need to make sure that you split it at the end of an object definition? 或者您是否需要确保在对象定义的末尾拆分它？

Instead of splitting the file, you should use multiple readers on the same file, using os.lseek to jump to the appropriate part of the file. 您应该在同一文件上使用多个读取器，而不是拆分文件，使用os.lseek跳转到文件的相应部分。

Update: Poster added that he wants to split on new lines. 更新：海报补充说他想拆分新线路。 Then I propose the following: 然后我提出以下建议：

Let's say you have 4 processes. 假设您有4个进程。 Then the simple solution is to os.lseek to 0%, 25%, 50% and 75% of the file, and read bytes until you hit the first new line. 然后简单的解决方案是os.lseek到文件的0％，25％，50％和75％，并读取字节，直到你点击第一个新行。 That's your starting point for each process. 这是每个流程的起点。 You don't need to split the file to do this, just seek to the right location in the large file in each process and start reading from there. 您不需要拆分文件来执行此操作，只需在每个进程中的大文件中寻找正确的位置并从那里开始读取。

I know you specifically asked about Python, but I will encourage you to look at Hadoop ( http://hadoop.apache.org/ ): it implements the Map and Reduce algorithm which was specifically designed to address this kind of problem. 我知道你特别询问过Python，但我会鼓励你看看Hadoop（ http://hadoop.apache.org/ ）：它实现了专门为解决这类问题而设计的Map和Reduce算法。

Good luck 祝好运

Fredrik Lundh's Some Notes on Tim Bray's Wide Finder Benchmark is an interesting read, about a very similar use case, with a lot of good advice. Fredrik Lundh 关于Tim Bray的Wide Finder Benchmark的一些注释是一个有趣的读物，关于一个非常相似的用例，有很多好的建议。 Various other authors also implemented the same thing, some are linked from the article, but you might want to try googling for "python wide finder" or something to find some more. 其他各种作者也实现了相同的东西，有些是从文章中链接的，但你可能想尝试使用谷歌搜索“python wide finder”或其他东西来寻找更多。 (there was also a solution somewhere based on the multiprocessing module, but that doesn't seem to be available anymore) （还有一个基于multiprocessing模块的解决方案，但似乎不再可用）

If the run time is long, instead of having each process read its next line through a Queue , have the processes read batches of lines. 如果运行时间很长， Queue要让每个进程通过Queue读取下一行，而是让进程读取多批行。 This way the overhead is amortized over several lines (eg thousands or more). 这样，开销在几行（例如数千或更多）上摊销。