Assume the following scenario - I have 1000 distinct IP addresses and 50 urls (webpages). I need to crawl these webpages with certain constraints in mind -
1.1.1.1
cannot be used to hit url http://example.com
more than once 1.1.1.1
shouldn't have crawled 100 times while some other ip has only done 4-5 crawls, as this isn't balanced I'm currently logging every crawl entry in Mysql table. So if 1.1.1.1
has visited http://example.com
and http://test.com
, there would be 2 entries in the table
(1.1.1.1, http://example.com)
and (1.1.1.1, http://test.com)
My loadbalancing strategey is this - Before every crawl, find the ip with the least crawls done so far and use that .
However, i feel this isn't very optimized as i'd have to perform a grouping query to get the count and then sort them, everytime before i do a crawling.
What would be some better ways of handling this?
PS : To speed up the crawling, I'm using multiple threads too
I'd consider using a list of the IP addresses and give that to itertools.cycle() . Then you simply give each URL to the next 500 IP addresses you get from itertools.cycle() .
one way to multithread this would be to take the output from cycle and push that to a blocking queue from one thread. Then you can have other threads that each take a URL and distribute to the next 500 IPs you get from the queue.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.