简体   繁体   English

抓取Google新闻头条

[英]scraping google news headlines

Google news is searchable by keyword and then that search can be narrowed down to a certain time period. 可通过关键字搜索Google新闻,然后将搜索范围缩小到特定时间段。

I tried doing the search on the website and then using the url of the results page to reverse engineer the search in python thus: 我尝试在网站上进行搜索,然后使用结果页的网址对python中的搜索进行反向工程,因此:

import urllib2


url = 'https://www.google.com/search?hl=en&gl=uk&tbm=nws&authuser=0&q=apple&oq=apple&gs_l=news-cc.3..43j0l9j43i53.5710.6848.0.7058.5.4.0.1.1.0.66.230.4.4.0...0.0...1ac.1.SRcIeXL5d48'

handler = urllib2.urlopen(url)
html = handler.read()

however, i get a 403 error. 但是,我收到403错误。 This method works with other websites, such as bbc.co.uk. 此方法可用于其他网站,例如bbc.co.uk。 so obviously google does not want me to scrape the website with python. 因此,显然Google不想让我用python抓取该网站。

so i have two questions: 1) is it possible to bypass this restriction google has placed? 所以我有两个问题:1)是否可以绕过Google设置的限制? if so, how? 如果是这样,怎么办? 2) are there any other scrapeable news sites where i can search for news on a keyword for a given period. 2)是否有其他可抓取的新闻网站,我可以在给定时间段内搜索有关关键字的新闻。

for either of the options, i don't mind using a paid service. 对于这两种选择,我都不介意使用付费服务。 so such suggestions are welcome too. 因此也欢迎此类建议。

thanks in advance, K. 预先感谢,K。

Try setting User-Agent 尝试设置User-Agent

req = urllib2.Request(path)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
response = urllib2.urlopen(req)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM