简体   繁体   中英

Multiprocessing HTTP get requests in Python

I have to make numerous (thousands) of HTTP GET requests to a great deal of websites. This is pretty slow, for reasons that some websites may not respond (or take long to do so), while others time out. As I need as many responses as I can get, setting a small timeout (3-5 seconds) is not in my favour.

I have yet to do any kind of multiprocessing or multi-threading in Python, and I've been reading the documentation for a good while. Here's what I have so far:

import requests
from bs4 import BeautifulSoup
from multiprocessing import Process, Pool

errors = 0

def get_site_content(site):
    try :
        # start = time.time()
        response = requests.get(site, allow_redirects = True, timeout=5)
        response.raise_for_status()
        content = response.text
    except Exception as e:
        global errors
        errors += 1
        return ''
    soup = BeautifulSoup(content)
    for script in soup(["script", "style"]):
        script.extract()
    text = soup.get_text()

    return text

sites = ["http://www.example.net", ...]

pool = Pool(processes=5)
results = pool.map(get_site_content, sites)
print results

Now, I want the results that are returned to be joined somehow. This allows two variation:

  1. Each process has a local list/queue that contains the content it has accumulated is joined with the other queues to form a single result, containing all the content for all sites.

  2. Each process writes to a single global queue as it goes along. This would entail some locking mechanism for concurrency checks.

Would multiprocessing or multithreading be the better choice here? How would I accomplish the above with either of the approaches in Python?


Edit :

I did attempt something like the following:

# global
queue = []
with Pool(processes = 5) as pool:
    queue.append(pool.map(get_site_contents, sites))

print queue

However, this gives me the following error:

with Pool(processes = 4) as pool:
AttributeError: __exit__

Which I don't quite understand. I'm having a little trouble understanding what exactly pool.map does, past applying the function on every object in the iterable second parameter. Does it return anything? If not, do I append to the global queue from within the function?

I had a similar assignment in university (to implement a multiprocess web crawler) and used a multiprocessing-safe Queue class from python multiprocessing library, which will do all the magic with locks and concurrency checks. The example from docs states:

import multiprocessing as mp

def foo(q):
    q.put('hello')

if __name__ == '__main__':
    mp.set_start_method('spawn')
    q = mp.Queue()
    p = mp.Process(target=foo, args=(q,))
    p.start()
    print(q.get())
    p.join()

However, I had to write a separate process class for this to work, as I wanted it to work. And I hadn't used a Pool of processes. Instead I tried to check memory usage and spawn a process until a preset memory threshold was reached.

pool.map starts 'n' number of processes that take a function and runs it with an item from the iterable. When such a process finishes and returns, the returned value is stored in a result list in the same position as the input item in the input variable.

eg: if a function is written to calculate square of a number and then a pool.map is used to run this function on a list of numbers. def square_this(x): square = x**2 return square

input_iterable = [2, 3, 4]
pool = Pool(processes=2) # Initalize a pool of 2 processes
result = pool.map(square_this, input_iterable) # Use the pool to run the function on the items in the iterable
pool.close() # this means that no more tasks will be added to the pool
pool.join() # this blocks the program till function is run on all the items
# print the result
print result

...>>[4, 9, 16]

The Pool.map technique may not be ideal in your case since it will block till all the processes finishes. ie If a website does not respond or takes too long to respond your program will be stuck waiting for it. Instead try sub-classing the multiprocessing.Process in your own class which polls these websites and use Queues to access the results. When you have a satisfactory number of responses you can stop all the processes without having to wait for the remaining requests to finish.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM