Multiprocessing HTTP get requests in Python

Question

I have to make numerous (thousands) of HTTP GET requests to a great deal of websites. This is pretty slow, for reasons that some websites may not respond (or take long to do so), while others time out. As I need as many responses as I can get, setting a small timeout (3-5 seconds) is not in my favour.

I have yet to do any kind of multiprocessing or multi-threading in Python, and I've been reading the documentation for a good while. Here's what I have so far:

import requests
from bs4 import BeautifulSoup
from multiprocessing import Process, Pool

errors = 0

def get_site_content(site):
    try :
        # start = time.time()
        response = requests.get(site, allow_redirects = True, timeout=5)
        response.raise_for_status()
        content = response.text
    except Exception as e:
        global errors
        errors += 1
        return ''
    soup = BeautifulSoup(content)
    for script in soup(["script", "style"]):
        script.extract()
    text = soup.get_text()

    return text

sites = ["http://www.example.net", ...]

pool = Pool(processes=5)
results = pool.map(get_site_content, sites)
print results

Now, I want the results that are returned to be joined somehow. This allows two variation:

Each process has a local list/queue that contains the content it has accumulated is joined with the other queues to form a single result, containing all the content for all sites.
Each process writes to a single global queue as it goes along. This would entail some locking mechanism for concurrency checks.

Would multiprocessing or multithreading be the better choice here? How would I accomplish the above with either of the approaches in Python?

Edit :

I did attempt something like the following:

# global
queue = []
with Pool(processes = 5) as pool:
    queue.append(pool.map(get_site_contents, sites))

print queue

However, this gives me the following error:

with Pool(processes = 4) as pool:
AttributeError: __exit__

Which I don't quite understand. I'm having a little trouble understanding what exactly pool.map does, past applying the function on every object in the iterable second parameter. Does it return anything? If not, do I append to the global queue from within the function?

Answer 1

I had a similar assignment in university (to implement a multiprocess web crawler) and used a multiprocessing-safe Queue class from python multiprocessing library, which will do all the magic with locks and concurrency checks. The example from docs states:

import multiprocessing as mp

def foo(q):
    q.put('hello')

if __name__ == '__main__':
    mp.set_start_method('spawn')
    q = mp.Queue()
    p = mp.Process(target=foo, args=(q,))
    p.start()
    print(q.get())
    p.join()

However, I had to write a separate process class for this to work, as I wanted it to work. And I hadn't used a Pool of processes. Instead I tried to check memory usage and spawn a process until a preset memory threshold was reached.

Answer 2

pool.map starts 'n' number of processes that take a function and runs it with an item from the iterable. When such a process finishes and returns, the returned value is stored in a result list in the same position as the input item in the input variable.

eg: if a function is written to calculate square of a number and then a pool.map is used to run this function on a list of numbers. def square_this(x): square = x**2 return square

input_iterable = [2, 3, 4]
pool = Pool(processes=2) # Initalize a pool of 2 processes
result = pool.map(square_this, input_iterable) # Use the pool to run the function on the items in the iterable
pool.close() # this means that no more tasks will be added to the pool
pool.join() # this blocks the program till function is run on all the items
# print the result
print result

...>>[4, 9, 16]

The Pool.map technique may not be ideal in your case since it will block till all the processes finishes. ie If a website does not respond or takes too long to respond your program will be stuck waiting for it. Instead try sub-classing the multiprocessing.Process in your own class which polls these websites and use Queues to access the results. When you have a satisfactory number of responses you can stop all the processes without having to wait for the remaining requests to finish.

Multiprocessing HTTP get requests in Python

Question

2 answers

solution1
3 ACCPTED 2014-12-18 13:40:56

solution2
3 2015-08-29 17:03:35

Multiprocessing HTTP get requests in Python

Question

2 answers

solution1 3 ACCPTED 2014-12-18 13:40:56

solution2 3 2015-08-29 17:03:35

solution1
3 ACCPTED 2014-12-18 13:40:56

solution2
3 2015-08-29 17:03:35