简体繁体中英

How to balance load for web crawling

原文 2018-02-08 15:27:32 4 1 mysql/ algorithm/ web-scraping/ architecture/ web-crawler

Assume the following scenario - I have 1000 distinct IP addresses and 50 urls (webpages). I need to crawl these webpages with certain constraints in mind -

Every single url must be visited by 500 different ip addresses. (ie 500 visits for every url)
An ip address should visit a url only once. eg: 1.1.1.1 cannot be used to hit url http://example.com more than once
The load amongst the ips should be balanced as much as possible, throughout the crawling. 1.1.1.1 shouldn't have crawled 100 times while some other ip has only done 4-5 crawls, as this isn't balanced

I'm currently logging every crawl entry in Mysql table. So if 1.1.1.1 has visited http://example.com and http://test.com , there would be 2 entries in the table

(1.1.1.1, http://example.com) and (1.1.1.1, http://test.com)

My loadbalancing strategey is this - Before every crawl, find the ip with the least crawls done so far and use that .

However, i feel this isn't very optimized as i'd have to perform a grouping query to get the count and then sort them, everytime before i do a crawling.

What would be some better ways of handling this?

PS : To speed up the crawling, I'm using multiple threads too

1 answers

I'd consider using a list of the IP addresses and give that to itertools.cycle() . Then you simply give each URL to the next 500 IP addresses you get from itertools.cycle() .

one way to multithread this would be to take the output from cycle and push that to a blocking queue from one thread. Then you can have other threads that each take a URL and distribute to the next 500 IPs you get from the queue.

How to build a efficient Load Balance?

What's the procedure of upload a java spring based web application to the real server, and how to use load balance to the server?

How to Load Balance or enhance DB performance at AWS RDS MySQL DB Instances

How to load balance read-only request to master,slave db MySQL server?

Split up db tables to balance load?

What is the best way to load balance PHP

Understand HAProxy to load balance on a Galera Cluster

CPU High Load because of search engines mass crawling

How to structure account balance system

How to get the last running balance?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to build a efficient Load Balance? What's the procedure of upload a java spring based web application to the real server, and how to use load balance to the server? How to Load Balance or enhance DB performance at AWS RDS MySQL DB Instances How to load balance read-only request to master,slave db MySQL server? Split up db tables to balance load? What is the best way to load balance PHP Understand HAProxy to load balance on a Galera Cluster CPU High Load because of search engines mass crawling How to structure account balance system How to get the last running balance?

Related Tags

How to balance load for web crawling

Question

1 answers

solution1 0 2018-02-11 13:40:28

solution1
0 2018-02-11 13:40:28