简体   繁体   中英

How to web crawl on Google

My requirement is to make a report on a given keyword by searching that keyword online.

My plan is that my webcrawler will

  1. Search the keyword on google or bing or yahoo
  2. Open pages/links of the website returned by google, bing or yahoo
  3. Make the report using those pages.

As I want to make a rule obeying webcrawler. So when I see the robots.txt of these websites I come to know that search engines have blocked the webcrawler to search keywords like

google.com/robots.txt

User-agent: *
Disallow: /search

I know that if I try to search keyword on the search engines my ip might be blocked.

My new plan that my webcrawler will

  1. Search the keyword on google or bing or yahoo ( max 2 - 3 times in different span of time a day)
  2. Open pages/links of the website return by google, bing or yahoo (giving 2 - 3 mins of delay in opening each page/link returned by search engine)
  3. Make the report using those pages.

Questions

  1. Let me know that even after so much care will google block my ip ? Is it safe to webcrawl like that ?
  2. Also let me know good techniques for using proxies to hide/change actual ip address.

PS: I am using Java and Jsoup for webcrawling

尝试selenium ,做您的工作。它用于自动化,所以我认为您的IP不会受到任何服务提供商的阻碍。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM