简体   繁体   中英

How to balance load for web crawling

Assume the following scenario - I have 1000 distinct IP addresses and 50 urls (webpages). I need to crawl these webpages with certain constraints in mind -

  1. Every single url must be visited by 500 different ip addresses. (ie 500 visits for every url)
  2. An ip address should visit a url only once. eg: 1.1.1.1 cannot be used to hit url http://example.com more than once
  3. The load amongst the ips should be balanced as much as possible, throughout the crawling. 1.1.1.1 shouldn't have crawled 100 times while some other ip has only done 4-5 crawls, as this isn't balanced

I'm currently logging every crawl entry in Mysql table. So if 1.1.1.1 has visited http://example.com and http://test.com , there would be 2 entries in the table

(1.1.1.1, http://example.com) and (1.1.1.1, http://test.com)

My loadbalancing strategey is this - Before every crawl, find the ip with the least crawls done so far and use that .

However, i feel this isn't very optimized as i'd have to perform a grouping query to get the count and then sort them, everytime before i do a crawling.

What would be some better ways of handling this?

PS : To speed up the crawling, I'm using multiple threads too

I'd consider using a list of the IP addresses and give that to itertools.cycle() . Then you simply give each URL to the next 500 IP addresses you get from itertools.cycle() .

one way to multithread this would be to take the output from cycle and push that to a blocking queue from one thread. Then you can have other threads that each take a URL and distribute to the next 500 IPs you get from the queue.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM