简体   繁体   English

使用Python进行Web抓取

[英]Web scraping using Python

I am trying to scrape the website http://www.nseindia.com using urllib2 and BeautifulSoup. 我正在尝试使用urllib2和BeautifulSoup来抓取网站http://www.nseindia.com Unfortunately, I keep getting 403 Forbidden when I try to access the page through Python. 不幸的是,当我尝试通过Python访问页面时,我不断获得403 Forbidden。 I thought it was a user agent issue, but changing that did not help. 我认为这是一个用户代理问题,但改变它没有帮助。 Then I thought it may have something to do with cookies, but apparently loading the page through links with cookies turned off works fine. 然后我认为它可能与cookie有关,但显然通过链接加载页面与cookie关闭工作正常。 What may be blocking requests through urllib? 什么可能通过urllib阻止请求?

http://www.nseindia.com/ seems to require an Accept header, for whatever reason. 无论出于何种原因, http: //www.nseindia.com/似乎都需要一个Accept标头。 This should work: 这应该工作:

import urllib2
r = urllib2.Request('http://www.nseindia.com/')
r.add_header('Accept', '*/*')
r.add_header('User-Agent', 'My scraping program <author@example.com>')
opener = urllib2.build_opener()
content = opener.open(r).read()

Refusing requests without Accept headers is incorrect; 拒绝没有Accept标头的请求是不正确的; RFC 2616 clearly states RFC 2616明确指出

If no Accept header field is present, then it is assumed that the client accepts all media types. 如果不存在Accept头字段,则假定客户端接受所有媒体类型。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM