简体繁体 English

Python - 轻松刮取Google，下载针对特定搜索的前N个点击（整个.html文档）？

[英]Python - Easy way to scrape Google, download top N hits (entire .html documents) for given search?

原文 2011-03-16 05:32:20 9 3 python/ web-scraping/ urllib2/ google-search

Is there an easy way to scrape Google and write the text (just the text) of the top N (say, 1000) .html (or whatever) documents for a given search? 是否有一种简单的方法可以刮取谷歌并为给定的搜索编写前N个（例如，1000）.html（或其他）文档的文本（只是文本）？

As an example, imagine searching for the phrase "big bad wolf" and downloading just the text from the top 1000 hits -- ie, actually downloading the text from those 1000 web pages (but just those pages, not the entire site). 例如，想象一下搜索短语“大坏狼”并从前1000个点击下载文本 - 即实际从这1000个网页下载文本（但只是那些页面，而不是整个网站）。

I'm assuming this would use the urllib2 library? 我假设这会使用urllib2库？ I use Python 3.1 if that helps. 如果有帮助我使用Python 3.1。

3 个解决方案

Check out BeautifulSoup for scraping the content out of web pages. 查看BeautifulSoup从网页中抓取内容。 It is supposed to be very tolerant of broken web pages which will help because not all results are well formed. 它应该对破碎的网页非常宽容，这将有所帮助，因为并非所有结果都很好。 So you should be able to: 所以你应该能够：

Request http://www.google.ca/search?q=QUERY_HERE 请求http://www.google.ca/search?q=QUERY_HERE
Extract and follow result links using BeautifulSoup (It appears as though class="r" for result links) 使用BeautifulSoup提取并跟踪结果链接（对于结果链接，似乎class =“r”）
Extract text from result pages using BeautifulSoup 使用BeautifulSoup从结果页面中提取文本

As mentioned, scraping Google violates their TOS. 如上所述，抓谷歌违反了他们的服务条款。 That said, that's probably not the answer you're looking for. 那就是说，这可能不是你想要的答案。

There's a PHP script available that does a perfect job of scraping Google: http://google-scraper.squabbel.com/ Just give it a keyword, # of results you want, and it'll return all the results for you. 有一个PHP脚本可以完美地抓取谷歌： http ： //google-scraper.squabbel.com/只需给它一个关键字，你想要的结果数量，它将为你返回所有结果。 Just parse for the URLs returned, use urllib, or curl to extract the HTML source, and you're done. 只需解析返回的URL，使用urllib或curl来提取HTML源代码，就完成了。

You also really shouldn't attempt to scrape Google unless you got more than 100 proxy servers though. 除非你有超过100个代理服务器，否则你也不应该试图刮掉谷歌。 They'll easily ban your IP temporarily after a few attempts. 几次尝试后，他们会暂时轻易禁止您的IP。

The official way to get results from Google programmatically is to use Google's Custom Search API . 以编程方式从Google获取结果的官方方法是使用Google的自定义搜索API 。 As icktoofay comments, other approaches (such as directly scraping the results or using the xgoogle module) break Google's terms of service . 作为icktoofay评论，其他方法（如直接抓取结果或使用xgoogle模块）违反了Google的服务条款。 Because of that, you might want to consider using the API from another search engine, such as the Bing API or Yahoo!'s service . 因此，您可能需要考虑使用其他搜索引擎的API，例如Bing API或Yahoo！的服务。