I have a Python project where I need to go through numerous sites and parse them.
I have noticed that in more than several instances, requests
fails to correctly get the site content even though the site opens just fine in Chrome and FF. For instance, in my code:
def get_site_content(site):
try :
content = requests.get(site, allow_redirects = True)
content = content.text
except Exception as e:
if DEBUG :
print type(e)
print e.args
print e
global errors
errors += 1
return ''
soup = BeautifulSoup(content)
# parse, tokenize and filter the content of the site
[...]
return tokenized_content
Afterwards, I do a check to see if site content is ''
. If so, I know that an error has occurred and I print out that loading that site has failed.
In my log:
Progress: [=========-] 1.8% Failed to load site : http://www.mocospace.com
[...]
Progress: [=========-] 87.8% Failed to load site : http://www.hotchalk.com
Progress: [=========-] 93.2% Failed to load site : http://Hollywire.com
Progress: [=========-] 93.8% Failed to load site : http://www.Allplaybook.com
If I run the exact same code in the Python shell, however:
$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> content = requests.get("http://www.mocospace.com", allow_redirects=True)
>>> content
<Response [200]>
>>> content.text
u'<?xml version="1.0" encoding="utf-8"?>\r\n<!DOCTYPE html PUBLIC [...]
In instances where I get a 403, it's still not an exception - as it should be.
>>> content = requests.get("http://www.hotchalk.com", allow_redirects=True)
>>> content
<Response [403]>
>>> content.text
u'<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body bgcolor="white">\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'
The only way for the log to say that loading has failed is if an exception was raised the get_site_content()
returned ''
:
# data is a list of all urls together with their category
for row in data:
content = get_site_content(row['URL'])
if content :
classifier_data.append((content, row['Category']))
else :
print "Failed to load site : %s" % row['URL']
What could this behaviour possibly be caused by? If this was C, I'd be looking for something involving pointers and undefined behaviour, but I cannot seem to find anything that could cause anything similar here.
Edit:
Using the robotparser module, I tried checking for one of the above sites' robots.txt
files and noted that User-agent: *
is set at the very top. I do not see any entries that would otherwise disallow me from accessing its index page, so could this have been caused by something else?
In Python shell:
>>> import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url("http://www.mocospace.com/robots.txt")
>>> rp.read()
>>> rp.can_fetch("*", "http://www.mocospace.com")
True
By default, requests
does not raise an exception when a server sends a response. If you want requests
to raise an exception for 4xx or 5xx response codes, then you need to explicitly tell it to do so:
response = requests.get(site, allow_redirects = True)
response.raise_for_status()
content = response.text
or inspect the response.status_code
attribute and alter your behaviour based on its value. Also see Response Status Codes in the quickstart.
As for sites behaving differently when called with requests
; remember that HTTP servers are essentially black boxes . With in the HTTP RFC, they are free to respond as they please. That includes filtering on headers and altering behaviour based on everything in the request, up to and including entirely random responses.
Your browser sends a different set of headers than requests
does; the usual culprit is the User-Agent
header but other headers such as Referrer
and Accept
are also quite often involved. This is not a bug in requests
.
It depends on each specific site configuration how they'll behave. You can try setting additional headers such as User-Agent
to try and spoof desktop browsers, but do take into account that not all sites welcome such behaviour. If you are spidering a site, try to honour their /robots.txt
policy and do not spider sites that request you do not . If you want to automate this process, you could the robotparser
module that comes with Python.
You can set additional headers with the headers
argument to requests.get()
:
headers = {'User-Agent': 'FooBar-Spider 1.0'}
response = requests.get(site, headers=headers)
but again, don't spoof browser user agent strings if a site is clearly asking you not to spider them.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.