简体   繁体   中英

Retry loading page on timeout with urllib2?

I am trying to force Python to retry loading the page when I get a timeout error. Is there a way that I can make it retry a specific number of times, possibly after a specific time delay?

Any help would be appreciated.

Thank you.

urllib2 doesn't have anything built-in for that, but you can write it yourself.

The tricky part is that, as the urlopen docs say, no matter what goes wrong, you just get a URLError . So, how do you know whether it was a timeout, or something else?

Well, if you look up URLError , it says it will have a reason which will be a socket.error for remote URLs. And if you look up socket.error it tells you that it's a subclass of either IOError or OSError (depending on your Python version). And if you look up OSError , it tells you that it has an errno that represents the underlying error.

So, which errno value do you get for timeout? I'm willing to bet it's EINPROGRESS , but let's find out for sure:

>>> urllib.urlopen('http://127.0.0.1', timeout=0)
urllib2.URLError: <urlopen error [Errno 36] Operation now in progress>
>>> errno.errorcode[36]
'EINPROGRESS'

(You could just use the number 36, but that's not guaranteed to be the same across platforms; errno.EINPROGRESS should be more portable.)

So:

import errno
import urllib2

def retrying_urlopen(retries, *args, **kwargs):
    for i in range(retries):
        try:
            return urllib2.urlopen(*args, **kwargs)
        except URLError as e:
            if e.reason.errno == errno.EINPROGRESS:
                continue
            raise

If you think this sucks and should be a lot less clunky… well, I think everyone agrees. Exceptions have been radically improved twice, with another big one coming up, plus various small changes along the way. But if you stick with 2.7, you don't get the benefits of those improvements.

If moving to Python 3.4 isn't possible, maybe moving to a third-party module like requests or urllib3 is. Both of those libraries have a separate exception type for Timeout , instead of making you grub through the details of a generic URLError .

Check out the requests library. If you'd like to wait only for a specified amount of time (not for the entire download, just until you get a response from the server), just add the timeout argument to the standard URL request, in seconds:

r = requests.get(url, timeout=10)

If the timeout time is exceeded, it raises a requests.exceptions.Timeout exception, which can be handled however you wish. As an example, you could put the request in a try/except block, catch the exception if it's raised, and retry the connection again for a specified number of times before failing completely.

You might also want to check out requests.adapters.HTTPAdapter , which has a max_retries argument. It's typically used within a Requests Session , and according to the docs, it provides a general-case interface for Requests sessions to contact HTTP and HTTPS urls by implementing the Transport Adapter interface.

Even I am new to Python, but I think even a simple solution like this could do the trick,

begin with considering stuff as None, where stuff is page_source. Also remember that I have only considered the URLError exception. You might want to add more as desired.

import urllib2
import time
stuff=None
max_attempts=4
r=0
while stuff is None and r<max_attempts:
    try:
        response = urllib2.urlopen('http://www.google.com/ncr', timeout=10)
        stuff = response.read()
    except urllib2.URLError:
        r=r+1
        print "Re-trying, attempt -- ",r
        time.sleep(5)
        pass
print stuff

Hope that helps.

Regards,

Md. Mohsin

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM