using multiple threads in Python

Question

I'm trying to solve a problem, where I have many (on the order of ten thousand) URLs, and need to download the content from all of them. I've been doing this in a "for link in links:" loop up till now, but the amount of time it's taking is now too long. I think it's time to implement a multithreaded or multiprocessing approach. My question is, what is the best approach to take?

I know about the Global Interpreter Lock, but since my problem is network-bound, not CPU-bound, I don't think that will be an issue. I need to pass data back from each thread/process to the main thread/process. I don't need help implementing whatever approach ( Terminate multiple threads when any thread completes a task covers that), I need advice on which approach to take. My current approach:

data_list = get_data(...)
output = []
for datum in data:
    output.append(get_URL_data(datum))
return output

There's no other shared state.

I think the best approach would be to have a queue with all the data in it, and have several worker threads pop from the input queue, get the URL data, then push onto an output queue.

Am I right? Is there anything I'm missing? This is my first time implementing multithreaded code in any language, and I know it's generally a Hard Problem.

Answer 1

For your specific task I would recommend a multiprocessing worker pool . You simply define a pool and tell it how many processes you want to use (one per processor core by default) as well as a function you want to run on each unit of work. Then you ready every unit of work (in your case this would be a list of URLs) in a list and give it to the worker pool.

Your output will be a list of the return values of your worker function for every item of work in your original array. All the cool multi-processing goodness will happen in the background. There is of course other ways of working with the worker pool as well, but this is my favourite one.

Happy multi-processing!

Answer 2

The fastest and most efficient method of doing IO bound tasks like this is an asynchronous event loop. The libcurl can do this, and there is a Python wrapper for that called pycurl. Using it's "multi" interface you can do high-performance client activities. I have done over 1000 simultaneous fetchs as fast as one.

However, the API is quite low-level and difficult to use. There is a simplifying wrapper here , which you can use as an example.

Answer 3

The best approach I can think of in your use case will be to use a thread pool and maintain a work queue. The threads in the thread pool get work from the work queue, do the work and then go get some more work. This way you can finely control the number of threads working on your URLs.

So, create a WorkQueue, which in your case is basically a list containing the URLs that need to be downloaded.

Create a thread pool, which create the number of threads you specify, fetches work from the WorkQueue and assigns it to a thread. Each time a thread finishes and returns you check if the work queues has more work and accordingly assign work to that thread again. You may also want to put a hook so that every time work is added to the work queue, your threads assigns it to a free thread if available.

using multiple threads in Python

Question

3 answers

solution1
5 ACCPTED 2012-07-24 20:03:54

solution2
2 2012-07-24 20:12:48

solution3
1 2012-07-24 20:09:14

using multiple threads in Python

Question

3 answers

solution1 5 ACCPTED 2012-07-24 20:03:54

solution2 2 2012-07-24 20:12:48

solution3 1 2012-07-24 20:09:14

solution1
5 ACCPTED 2012-07-24 20:03:54

solution2
2 2012-07-24 20:12:48

solution3
1 2012-07-24 20:09:14