抓取Google新闻头条

Question

Google news is searchable by keyword and then that search can be narrowed down to a certain time period. 可通过关键字搜索Google新闻，然后将搜索范围缩小到特定时间段。

I tried doing the search on the website and then using the url of the results page to reverse engineer the search in python thus: 我尝试在网站上进行搜索，然后使用结果页的网址对python中的搜索进行反向工程，因此：

import urllib2


url = 'https://www.google.com/search?hl=en&gl=uk&tbm=nws&authuser=0&q=apple&oq=apple&gs_l=news-cc.3..43j0l9j43i53.5710.6848.0.7058.5.4.0.1.1.0.66.230.4.4.0...0.0...1ac.1.SRcIeXL5d48'

handler = urllib2.urlopen(url)
html = handler.read()

however, i get a 403 error. 但是，我收到403错误。 This method works with other websites, such as bbc.co.uk. 此方法可用于其他网站，例如bbc.co.uk。 so obviously google does not want me to scrape the website with python. 因此，显然Google不想让我用python抓取该网站。

so i have two questions: 1) is it possible to bypass this restriction google has placed? 所以我有两个问题：1）是否可以绕过Google设置的限制？ if so, how? 如果是这样，怎么办？ 2) are there any other scrapeable news sites where i can search for news on a keyword for a given period. 2）是否有其他可抓取的新闻网站，我可以在给定时间段内搜索有关关键字的新闻。

for either of the options, i don't mind using a paid service. 对于这两种选择，我都不介意使用付费服务。 so such suggestions are welcome too. 因此也欢迎此类建议。

thanks in advance, K. 预先感谢，K。

Answer 1

Try setting User-Agent 尝试设置User-Agent

req = urllib2.Request(path)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
response = urllib2.urlopen(req)

抓取Google新闻头条

问题描述

1 个解决方案

解决方案1
2 已采纳 2014-11-29 00:12:55

抓取Google新闻头条

问题描述

1 个解决方案

解决方案1 2 已采纳 2014-11-29 00:12:55

解决方案1
2 已采纳 2014-11-29 00:12:55