Python中处理大文件的最快方法

Question

We have a about 500GB of images in various directories we need to process.我们在需要处理的各种目录中有大约 500GB 的图像。 Each image is about 4MB in size and we have a python script to process each image one at a time (it reads metadata and stores it in a database).每张图片的大小约为 4MB，我们有一个 python 脚本来一次处理一张图片（它读取元数据并将其存储在数据库中）。 Each directory can take 1-4 hours to process depending on size.每个目录可能需要 1-4 小时来处理，具体取决于大小。

We have at our disposal a 2.2Ghz quad core processor and 16GB of RAM on a GNU/Linux OS.我们在 GNU/Linux 操作系统上拥有一个 2.2Ghz 四核处理器和 16GB 内存。 The current script is utilizing only one processor.当前脚本仅使用一个处理器。 What's the best way to take advantage of the other cores and RAM to process images faster?利用其他内核和 RAM 来更快地处理图像的最佳方法是什么？ Will starting multiple Python processes to run the script take advantage of the other cores?启动多个 Python 进程来运行脚本会利用其他内核吗？

Another option is to use something like Gearman or Beanstalk to farm out the work to other machines.另一种选择是使用 Gearman 或 Beanstalk 之类的工具将工作外包给其他机器。 I've taken a look at the multiprocessing library but not sure how I can utilize it.我看过多处理库，但不确定如何使用它。

Answer 1

Will starting multiple Python processes to run the script take advantage of the other cores?启动多个 Python 进程来运行脚本会利用其他内核吗？

Yes, it will, if the task is CPU-bound.是的，如果任务受 CPU 限制，它会。 This is probably the easiest option.这可能是最简单的选择。 However, don't spawn a single process per file or per directory;但是，不要为每个文件或每个目录生成一个进程； consider using a tool such as parallel(1) and let it spawn something like two processes per core.考虑使用诸如parallel(1)之类的工具，并让它产生类似每个内核两个进程的东西。

Another option is to use something like Gearman or Beanstalk to farm out the work to other machines.另一种选择是使用 Gearman 或 Beanstalk 之类的工具将工作外包给其他机器。

That might work.那可能行得通。 Also, have a look at the Python binding for ZeroMQ , it makes distributed processing pretty easy.另外，看看ZeroMQ 的 Python 绑定，它使分布式处理变得非常容易。

I've taken a look at the multiprocessing library but not sure how I can utilize it.我看过多处理库，但不确定如何使用它。

Define a function, say process , that reads the images in a single directory, connects to the database and stores the metadata.定义一个 function，比如process ，它读取单个目录中的图像，连接到数据库并存储元数据。 Let it return a boolean indicating success or failure.让它返回一个 boolean 表示成功或失败。 Let directories be the list of directories to process.让directories成为要处理的目录列表。 Then然后

import multiprocessing
pool = multiprocessing.Pool(multiprocessing.cpu_count())
success = all(pool.imap_unordered(process, directories))

will process all the directories in parallel.将并行处理所有目录。 You can also do the parallelism at the file-level if you want;如果需要，您还可以在文件级别进行并行处理； that needs just a bit more tinkering.这需要更多的修补。

Note that this will stop at the first failure;请注意，这将在第一次失败时停止； making it fault-tolerant takes a bit more work.使其具有容错性需要做更多的工作。

Answer 2

Starting independent Python processes is ideal.启动独立的 Python 进程是理想的。 There will be no lock contentions between the processes, and the OS will schedule them to run concurrently.进程之间不会发生锁争用，操作系统会安排它们并发运行。

You may want to experiment to see what the ideal number of instances is - it may be more or less than the number of cores.您可能想尝试看看理想的实例数是多少——它可能多于或少于核心数。 There will be contention for the disk and cache memory, but on the other hand you may get one process to run while another is waiting for I/O.磁盘和高速缓存 memory 将发生争用，但另一方面，您可能会在另一个等待 I/O 的同时让一个进程运行。

Answer 3

You can use pool of multiprocessing to create processes for increasing performance.您可以使用多处理池来创建进程以提高性能。 Let's say, you have a function handle_file which is for processing image.比方说，您有一个 function handle_file 用于处理图像。 If you use iteration, it can only use at most 100% of one your core.如果你使用迭代，它最多只能使用你的核心的 100%。 To utilize multiple cores, Pool multiprocessing creates subprocesses for you, and it distributes your task to them.为了利用多个核心，Pool multiprocessing 会为您创建子流程，并将您的任务分配给它们。 Here is an example:这是一个例子：

import os
import multiprocessing

def handle_file(path):
    print 'Do something to handle file ...', path

def run_multiprocess():
    tasks = []

    for filename in os.listdir('.'):
        tasks.append(filename)
        print 'Create task', filename

    pool = multiprocessing.Pool(8)
    result = all(list(pool.imap_unordered(handle_file, tasks)))
    print 'Finished, result=', result

def run_one_process():
    for filename in os.listdir('.'):
        handle_file(filename)

if __name__ == '__main__':
    run_one_process
    run_multiprocess()

The run_one_process is single core way to process data, simple, but slow. run_one_process 是处理数据的单核方式，简单但速度慢。 On the other hand, run_multiprocess creates 8 worker processes, and distribute tasks to them.另一方面，run_multiprocess 会创建 8 个工作进程，并向它们分发任务。 It would be about 8 times faster if you have 8 cores.如果你有 8 个核心，它会快大约 8 倍。 I suggest you set the worker number to double of your cores or exactly the number of your cores.我建议您将工作人员数量设置为您的核心数的两倍或恰好是您的核心数。 You can try it and see which configuration is faster.您可以尝试一下，看看哪种配置更快。

For advanced distributed computing, you can use ZeroMQ as larsmans mentioned.对于高级分布式计算，您可以像 larsmans 提到的那样使用ZeroMQ 。 It's hard to understand at first.一开始很难理解。 But once you understand it, you can design a very efficient distributed system to process your data.但是一旦你理解了它，你就可以设计一个非常高效的分布式系统来处理你的数据。 In your case, I think one REQ with multiple REP would be good enough.对于您的情况，我认为一个 REQ 和多个 REP 就足够了。

在此处输入图像描述

Hope this would be helpful.希望这会有所帮助。

Answer 4

See the answer to this question .请参阅此问题的答案。

If the app can process ranges of input data, then you can launch 4 instances of the app with different ranges of input data to process and the combine the results after they are all done.如果应用程序可以处理输入数据范围，那么您可以启动 4 个具有不同输入数据范围的应用程序实例进行处理，并在完成后合并结果。

Even though that question looks to be Windows specific, it applies to single threaded programs on all operating system.尽管该问题看起来是特定于 Windows 的，但它适用于所有操作系统上的单线程程序。

WARNING: Beware that this process will be I/O bound and too much concurrent access to your hard drive will actually cause the processes as a group to execute slower than sequential processing because of contention for the I/O resource.警告：请注意，此进程将受 I/O 限制，并且由于对 I/O 资源的争用，对硬盘驱动器的并发访问过多实际上会导致进程作为一个组执行得比顺序处理慢。

Answer 5

If you are reading a large number of files and saving metadata to a database you program does not need more cores.如果您正在读取大量文件并将元数据保存到数据库中，则您的程序不需要更多内核。

Your process is likely IO bound not CPU bound.您的进程很可能是 IO 绑定而不是 CPU 绑定。 Using twisted with proper defereds and callbacks would likely outperform any solution that sought to enlist 4 cores.将 twisted 与适当的延迟和回调一起使用可能会胜过任何试图征用 4 个核心的解决方案。

Answer 6

I think in this scenario it would make perfectly sense to use Celery .我认为在这种情况下使用Celery是完全合理的。

Python中处理大文件的最快方法

问题描述

6 个解决方案

解决方案1
6 已采纳 2012-04-04 14:17:30

解决方案2
4 2012-04-04 14:18:53

解决方案3
4 2012-04-04 14:51:00

解决方案4
2

解决方案5
0 2012-04-04 15:13:45

解决方案6
0 2015-01-22 12:24:31

Python中处理大文件的最快方法

问题描述

6 个解决方案

解决方案1 6 已采纳 2012-04-04 14:17:30

解决方案2 4 2012-04-04 14:18:53

解决方案3 4 2012-04-04 14:51:00

解决方案4 2

解决方案5 0 2012-04-04 15:13:45

解决方案6 0 2015-01-22 12:24:31

解决方案1
6 已采纳 2012-04-04 14:17:30

解决方案2
4 2012-04-04 14:18:53

解决方案3
4 2012-04-04 14:51:00

解决方案4
2

解决方案5
0 2012-04-04 15:13:45

解决方案6
0 2015-01-22 12:24:31