简体   繁体   中英

urllib2 urlopen read timeout/block

Recently I am working on a tiny crawler for downloading images on a url.

I use openurl() in urllib2 with f.open()/f.write():

Here is the code snippet:

# the list for the images' urls
imglist = re.findall(regImg,pageHtml)

# iterate to download images
for index in xrange(1,len(imglist)+1):
    img = urllib2.urlopen(imglist[index-1])
    f = open(r'E:\OK\%s.jpg' % str(index), 'wb')
    print('To Read...')

    # potential timeout, may block for a long time
    # so I wonder whether there is any mechanism to enable retry when time exceeds a certain threshold
    f.write(img.read())
    f.close()
    print('Image %d is ready !' % index)

In the code above, the img.read() will potentially block for a long time, I hope to do some retry/re-open the image url operation under this issue.

I also concern on the efficient perspective of the code above, if the number of the images to be downloaded is somewhat big, using a thread pool to download them seems to be better.

Any suggestions? Thanks in advance.

ps I found the read() method on img object may cause blocking, so adding a timeout parameter to the urlopen() alone seems useless. But I found file object has no timeout version of read(). Any suggestions on this ? Thanks very much .

The urllib2.urlopen has a timeout parameter which is used for all blocking operations (connection buildup etc.)

This snippet is taken from one of my projects. I use a thread pool to download multiple files at once. It uses urllib.urlretrieve but the logic is the same. The url_and_path_list is a list of (url, path) tuples, the num_concurrent is the number of threads to be spawned, and the skip_existing skips downloading of files if they already exist in the filesystem.

def download_urls(url_and_path_list, num_concurrent, skip_existing):
    # prepare the queue
    queue = Queue.Queue()
    for url_and_path in url_and_path_list:
        queue.put(url_and_path)

    # start the requested number of download threads to download the files
    threads = []
    for _ in range(num_concurrent):
        t = DownloadThread(queue, skip_existing)
        t.daemon = True
        t.start()

    queue.join()

class DownloadThread(threading.Thread):
    def __init__(self, queue, skip_existing):
        super(DownloadThread, self).__init__()
        self.queue = queue
        self.skip_existing = skip_existing

    def run(self):
        while True:
            #grabs url from queue
            url, path = self.queue.get()

            if self.skip_existing and exists(path):
                # skip if requested
                self.queue.task_done()
                continue

            try:
                urllib.urlretrieve(url, path)
            except IOError:
                print "Error downloading url '%s'." % url

            #signals to queue job is done
            self.queue.task_done()

When you create tje connection with urllib2.urlopen() , you can give a timeout parameter.

As described in the doc :

The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt (if not specified, the global default timeout setting will be used). This actually only works for HTTP, HTTPS and FTP connections.

With this you will be able to manage a maximum waiting duration and catch the exception raised.

The way I crawl a huge batch of documents is by having batch processor which crawls and dumps constant sized chunks.

Suppose you are to crawl a pre-known batch of say 100K documents. You can have some logic to generate constant size chunks of say 1000 documents which would be downloaded by a threadpool. Once the whole chunk is crawled, you can have bulk insert in your database. And then proceed with further 1000 documents and so on.

Advantages you get by following this approach:

  • You get the advantage of threadpool speeding up your crawl rate.

  • Its fault tolerant in the sense, you can continue from the chunk where it last failed.

  • You can have chunks generated on the basis of priority ie important documents to crawl first. So in case you are unable to complete the whole batch. Important documents are processed and less important documents can be picked up later on the next run.

An ugly hack that seems to work.

import os, socket, threading, errno

def timeout_http_body_read(response, timeout = 60):
    def murha(resp):
        os.close(resp.fileno())
        resp.close()

    # set a timer to yank the carpet underneath the blocking read() by closing the os file descriptor
    t = threading.Timer(timeout, murha, (response,))
    try:
        t.start()
        body = response.read()
        t.cancel()
    except socket.error as se:
        if se.errno == errno.EBADF: # murha happened
            return (False, None)
        raise
    return (True, body)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM