简体   繁体   中英

Distributed computing in Python - web crawler

My objective is to build a distributed crawler that processes more than 1 website at a time and more than 1 query also. For this, I have built a web crawler in Python using standard packages like 'requests' and 'BeautifulSoup'. It works fine. In order to make it distributed, I used rabbitMQ. It enables me to make the system faster by having more than 1 process helping the crawl.

My system works in a workpool model:

  • I have a main server receiving queries and starting a new crawl for each of them.
  • When starting the crawl, some urls are gathered by inputting the query into a search engine.
  • From now on, the main server sends urls to the available workers/processes with rabbitMQ and waits to receive more urls from them.

However, I have a huge bottleneck in this architecture, and it is not the main server... rabbitMQ does not allow me to consume more than 1 message at a time (channel.basic_qos() function does not work!). What I wanted was to have a private queue for each query (as I have now) and be able to process these 2 queries at the same time, as fast as possible. By that means, parallelize the workers code so that it can process the max amount of urls, instead of 1 url at a time.

What should I use to replace rabbitMQ in here? I specifically reached rabbitMQ's developers and what I want can't be done with it so I am trying to find a different 'distribution package'. maybe Kafka?

First, you will have to determine what the limits on your program are. Is it I/O bound or CPU bound?

Eg if a simple non-parallel version of your program can saturate your network connection, there is no use in making a parallel version.

For a web crawler, it is often the network that is the ultimate bottleneck.

But let's assume that you have sufficient network capacity. In that case, I would suggest using a multiprocessing.Pool . The function that acts as the worker process takes a URL as input, and returns the processed data from the URL. You create a pool, and use the imap_unordered method to apply the worker functions to a list of URLs;

def worker(url):
   rv = requests.get(url)
   # Do whatever processing is needed.
   result = ...
   return (url, result)

urls = ['www.foo.com', ...]
p = multiprocessing.Pool()
for url, result in p.imap_unordered(worker, urls):
    print('The URL ', url, 'returned ', result)

By default, the Pool will use as many workers as the CPU has cores. Using more workers is generally not useful.

Note that the return value from the worker has to be returned from the worker process to the parent process. So it has to be pickle-able. If you have to return huge amount of data, that might become an IPC bottleneck. In that case it is probably better to write it to eg a RAM-based filesystem and just return the filename for the data-file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM