使用Python进行Web抓取

Question

I am trying to scrape the website http://www.nseindia.com using urllib2 and BeautifulSoup. 我正在尝试使用urllib2和BeautifulSoup来抓取网站http://www.nseindia.com 。 Unfortunately, I keep getting 403 Forbidden when I try to access the page through Python. 不幸的是，当我尝试通过Python访问页面时，我不断获得403 Forbidden。 I thought it was a user agent issue, but changing that did not help. 我认为这是一个用户代理问题，但改变它没有帮助。 Then I thought it may have something to do with cookies, but apparently loading the page through links with cookies turned off works fine. 然后我认为它可能与cookie有关，但显然通过链接加载页面与cookie关闭工作正常。 What may be blocking requests through urllib? 什么可能通过urllib阻止请求？

Answer 1

http://www.nseindia.com/ seems to require an Accept header, for whatever reason. 无论出于何种原因， http： //www.nseindia.com/似乎都需要一个Accept标头。 This should work: 这应该工作：

import urllib2
r = urllib2.Request('http://www.nseindia.com/')
r.add_header('Accept', '*/*')
r.add_header('User-Agent', 'My scraping program <author@example.com>')
opener = urllib2.build_opener()
content = opener.open(r).read()

Refusing requests without Accept headers is incorrect; 拒绝没有Accept标头的请求是不正确的; RFC 2616 clearly states RFC 2616明确指出

If no Accept header field is present, then it is assumed that the client accepts all media types. 如果不存在Accept头字段，则假定客户端接受所有媒体类型。

使用Python进行Web抓取

问题描述

1 个解决方案

解决方案1
9 2011-08-06 23:10:59

使用Python进行Web抓取

问题描述

1 个解决方案

解决方案1 9 2011-08-06 23:10:59

解决方案1
9 2011-08-06 23:10:59