简体   繁体   中英

Fastest way to process large files in Python

We have a about 500GB of images in various directories we need to process. Each image is about 4MB in size and we have a python script to process each image one at a time (it reads metadata and stores it in a database). Each directory can take 1-4 hours to process depending on size.

We have at our disposal a 2.2Ghz quad core processor and 16GB of RAM on a GNU/Linux OS. The current script is utilizing only one processor. What's the best way to take advantage of the other cores and RAM to process images faster? Will starting multiple Python processes to run the script take advantage of the other cores?

Another option is to use something like Gearman or Beanstalk to farm out the work to other machines. I've taken a look at the multiprocessing library but not sure how I can utilize it.

Will starting multiple Python processes to run the script take advantage of the other cores?

Yes, it will, if the task is CPU-bound. This is probably the easiest option. However, don't spawn a single process per file or per directory; consider using a tool such as parallel(1) and let it spawn something like two processes per core.

Another option is to use something like Gearman or Beanstalk to farm out the work to other machines.

That might work. Also, have a look at the Python binding for ZeroMQ , it makes distributed processing pretty easy.

I've taken a look at the multiprocessing library but not sure how I can utilize it.

Define a function, say process , that reads the images in a single directory, connects to the database and stores the metadata. Let it return a boolean indicating success or failure. Let directories be the list of directories to process. Then

import multiprocessing
pool = multiprocessing.Pool(multiprocessing.cpu_count())
success = all(pool.imap_unordered(process, directories))

will process all the directories in parallel. You can also do the parallelism at the file-level if you want; that needs just a bit more tinkering.

Note that this will stop at the first failure; making it fault-tolerant takes a bit more work.

Starting independent Python processes is ideal. There will be no lock contentions between the processes, and the OS will schedule them to run concurrently.

You may want to experiment to see what the ideal number of instances is - it may be more or less than the number of cores. There will be contention for the disk and cache memory, but on the other hand you may get one process to run while another is waiting for I/O.

You can use pool of multiprocessing to create processes for increasing performance. Let's say, you have a function handle_file which is for processing image. If you use iteration, it can only use at most 100% of one your core. To utilize multiple cores, Pool multiprocessing creates subprocesses for you, and it distributes your task to them. Here is an example:

import os
import multiprocessing

def handle_file(path):
    print 'Do something to handle file ...', path

def run_multiprocess():
    tasks = []

    for filename in os.listdir('.'):
        tasks.append(filename)
        print 'Create task', filename

    pool = multiprocessing.Pool(8)
    result = all(list(pool.imap_unordered(handle_file, tasks)))
    print 'Finished, result=', result

def run_one_process():
    for filename in os.listdir('.'):
        handle_file(filename)

if __name__ == '__main__':
    run_one_process
    run_multiprocess()

The run_one_process is single core way to process data, simple, but slow. On the other hand, run_multiprocess creates 8 worker processes, and distribute tasks to them. It would be about 8 times faster if you have 8 cores. I suggest you set the worker number to double of your cores or exactly the number of your cores. You can try it and see which configuration is faster.

For advanced distributed computing, you can use ZeroMQ as larsmans mentioned. It's hard to understand at first. But once you understand it, you can design a very efficient distributed system to process your data. In your case, I think one REQ with multiple REP would be good enough.

在此处输入图像描述

Hope this would be helpful.

See the answer to this question .

If the app can process ranges of input data, then you can launch 4 instances of the app with different ranges of input data to process and the combine the results after they are all done.

Even though that question looks to be Windows specific, it applies to single threaded programs on all operating system.

WARNING: Beware that this process will be I/O bound and too much concurrent access to your hard drive will actually cause the processes as a group to execute slower than sequential processing because of contention for the I/O resource.

If you are reading a large number of files and saving metadata to a database you program does not need more cores.

Your process is likely IO bound not CPU bound. Using twisted with proper defereds and callbacks would likely outperform any solution that sought to enlist 4 cores.

I think in this scenario it would make perfectly sense to use Celery .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM