How to web crawl on Google

Question

My requirement is to make a report on a given keyword by searching that keyword online.

My plan is that my webcrawler will

Search the keyword on google or bing or yahoo
Open pages/links of the website returned by google, bing or yahoo
Make the report using those pages.

As I want to make a rule obeying webcrawler. So when I see the robots.txt of these websites I come to know that search engines have blocked the webcrawler to search keywords like

google.com/robots.txt

User-agent: *
Disallow: /search

I know that if I try to search keyword on the search engines my ip might be blocked.

My new plan that my webcrawler will

Search the keyword on google or bing or yahoo ( max 2 - 3 times in different span of time a day)
Open pages/links of the website return by google, bing or yahoo (giving 2 - 3 mins of delay in opening each page/link returned by search engine)
Make the report using those pages.

Questions

Let me know that even after so much care will google block my ip ? Is it safe to webcrawl like that ?
Also let me know good techniques for using proxies to hide/change actual ip address.

PS: I am using Java and Jsoup for webcrawling

Answer 1

尝试selenium ，做您的工作。它用于自动化，所以我认为您的IP不会受到任何服务提供商的阻碍。

How to web crawl on Google

Question

1 answers

solution1
0

How to web crawl on Google

Question

1 answers

solution1 0

solution1
0