简体   繁体   中英

Python Requests library sometimes fails to open site that a browser can open

I have a Python project where I need to go through numerous sites and parse them.

I have noticed that in more than several instances, requests fails to correctly get the site content even though the site opens just fine in Chrome and FF. For instance, in my code:

def get_site_content(site):
    try :
        content = requests.get(site, allow_redirects = True)
        content = content.text
    except Exception as e:
        if DEBUG :
            print type(e)
            print e.args
            print e
        global errors
        errors += 1
        return ''

    soup = BeautifulSoup(content)
    # parse, tokenize and filter the content of the site
    [...]
    return tokenized_content  

Afterwards, I do a check to see if site content is '' . If so, I know that an error has occurred and I print out that loading that site has failed.

In my log:

Progress: [=========-] 1.8% Failed to load site : http://www.mocospace.com
[...]
Progress: [=========-] 87.8% Failed to load site : http://www.hotchalk.com
Progress: [=========-] 93.2% Failed to load site : http://Hollywire.com
Progress: [=========-] 93.8% Failed to load site : http://www.Allplaybook.com

If I run the exact same code in the Python shell, however:

$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> content = requests.get("http://www.mocospace.com", allow_redirects=True)
>>> content
<Response [200]>
>>> content.text
u'<?xml version="1.0" encoding="utf-8"?>\r\n<!DOCTYPE html PUBLIC [...]

In instances where I get a 403, it's still not an exception - as it should be.

>>> content = requests.get("http://www.hotchalk.com", allow_redirects=True)
>>> content
<Response [403]>
>>> content.text
u'<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body bgcolor="white">\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'

The only way for the log to say that loading has failed is if an exception was raised the get_site_content() returned '' :

# data is a list of all urls together with their category
for row in data:
    content = get_site_content(row['URL'])

    if content :
        classifier_data.append((content, row['Category']))
    else :
        print "Failed to load site : %s" % row['URL']

What could this behaviour possibly be caused by? If this was C, I'd be looking for something involving pointers and undefined behaviour, but I cannot seem to find anything that could cause anything similar here.


Edit:

Using the robotparser module, I tried checking for one of the above sites' robots.txt files and noted that User-agent: * is set at the very top. I do not see any entries that would otherwise disallow me from accessing its index page, so could this have been caused by something else?

In Python shell:

>>> import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url("http://www.mocospace.com/robots.txt")
>>> rp.read()
>>> rp.can_fetch("*", "http://www.mocospace.com")
True

By default, requests does not raise an exception when a server sends a response. If you want requests to raise an exception for 4xx or 5xx response codes, then you need to explicitly tell it to do so:

response = requests.get(site, allow_redirects = True)
response.raise_for_status()
content = response.text

or inspect the response.status_code attribute and alter your behaviour based on its value. Also see Response Status Codes in the quickstart.

As for sites behaving differently when called with requests ; remember that HTTP servers are essentially black boxes . With in the HTTP RFC, they are free to respond as they please. That includes filtering on headers and altering behaviour based on everything in the request, up to and including entirely random responses.

Your browser sends a different set of headers than requests does; the usual culprit is the User-Agent header but other headers such as Referrer and Accept are also quite often involved. This is not a bug in requests .

It depends on each specific site configuration how they'll behave. You can try setting additional headers such as User-Agent to try and spoof desktop browsers, but do take into account that not all sites welcome such behaviour. If you are spidering a site, try to honour their /robots.txt policy and do not spider sites that request you do not . If you want to automate this process, you could the robotparser module that comes with Python.

You can set additional headers with the headers argument to requests.get() :

headers = {'User-Agent': 'FooBar-Spider 1.0'}
response = requests.get(site, headers=headers)

but again, don't spoof browser user agent strings if a site is clearly asking you not to spider them.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM