简体   繁体   中英

Python multiprocessing BETWEEN Amazon cloud instances

I'm looking to run a long-running python analysis process on a few Amazon EC2 instances. The code already runs using the python multiprocessing module and can take advantage of all cores on a single machine.

The analysis is completely parellel and each instance does not need to communicate with any of the others. All of the work is "file-based" and each process works on each file indivually ... so I was planning on just mounting the same S3 volume across all of the nodes.

I was wondering if anyone knew of any tutorials (or had any suggestions) for setting up the multiprocessing environment so I can run it on an arbitrary number of compute-instances at the same time.

the docs give you a good setup for running multiprocessing on multiple machines . Using s3 is a good way to share files across ec2 instances, but with multiprocessing you can share queues and pass data.

if you can use hadoop for parallel tasks, it is a very good way to extract parallelism across machines, but if you need a lot of IPC then building your own solution with multiprocessing isn't that bad.

just make sure you put your machines in the same security groups :-)

I would use dumbo . It is a python wrapper for Hadoop that is compatible with Amazon Elastic MapReduce. Write a little wrapper around your code to integrate with dumbo. Note that you probably need a map-only job with no reduce step.

I've been digging into IPython recently, and it looks like it supports parallel processing accross multiple hosts right out of the box:

http://ipython.org/ipython-doc/stable/html/parallel/index.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM