简体   繁体   English

urllib2 urlopen读取超时/块

[英]urllib2 urlopen read timeout/block

Recently I am working on a tiny crawler for downloading images on a url. 最近我正在研究一个小型爬虫,用于在网址上下载图像。

I use openurl() in urllib2 with f.open()/f.write(): 我在urllib2中使用openurl()和f.open()/ f.write():

Here is the code snippet: 这是代码片段:

# the list for the images' urls
imglist = re.findall(regImg,pageHtml)

# iterate to download images
for index in xrange(1,len(imglist)+1):
    img = urllib2.urlopen(imglist[index-1])
    f = open(r'E:\OK\%s.jpg' % str(index), 'wb')
    print('To Read...')

    # potential timeout, may block for a long time
    # so I wonder whether there is any mechanism to enable retry when time exceeds a certain threshold
    f.write(img.read())
    f.close()
    print('Image %d is ready !' % index)

In the code above, the img.read() will potentially block for a long time, I hope to do some retry/re-open the image url operation under this issue. 在上面的代码中,img.read()可能会阻塞很长时间,我希望在这个问题下做一些重试/重新打开图像url操作。

I also concern on the efficient perspective of the code above, if the number of the images to be downloaded is somewhat big, using a thread pool to download them seems to be better. 我还关注上面代码的高效视角,如果要下载的图像数量有点大,使用线程池下载它们似乎更好。

Any suggestions? 有什么建议? Thanks in advance. 提前致谢。

ps I found the read() method on img object may cause blocking, so adding a timeout parameter to the urlopen() alone seems useless. ps我发现img对象上的read()方法可能会导致阻塞,因此单独向urlopen()添加timeout参数似乎没用。 But I found file object has no timeout version of read(). 但我发现文件对象没有read()的超时版本。 Any suggestions on this ? 有什么建议吗? Thanks very much . 非常感谢 。

The urllib2.urlopen has a timeout parameter which is used for all blocking operations (connection buildup etc.) urllib2.urlopen有一个timeout参数,用于所有阻塞操作(连接建立等)

This snippet is taken from one of my projects. 这个片段取自我的一个项目。 I use a thread pool to download multiple files at once. 我使用线程池一次下载多个文件。 It uses urllib.urlretrieve but the logic is the same. 它使用urllib.urlretrieve但逻辑是相同的。 The url_and_path_list is a list of (url, path) tuples, the num_concurrent is the number of threads to be spawned, and the skip_existing skips downloading of files if they already exist in the filesystem. url_and_path_list(url, path)元组的列表, num_concurrent是要生成的线程数, skip_existing跳过文件系统中已存在的文件下载。

def download_urls(url_and_path_list, num_concurrent, skip_existing):
    # prepare the queue
    queue = Queue.Queue()
    for url_and_path in url_and_path_list:
        queue.put(url_and_path)

    # start the requested number of download threads to download the files
    threads = []
    for _ in range(num_concurrent):
        t = DownloadThread(queue, skip_existing)
        t.daemon = True
        t.start()

    queue.join()

class DownloadThread(threading.Thread):
    def __init__(self, queue, skip_existing):
        super(DownloadThread, self).__init__()
        self.queue = queue
        self.skip_existing = skip_existing

    def run(self):
        while True:
            #grabs url from queue
            url, path = self.queue.get()

            if self.skip_existing and exists(path):
                # skip if requested
                self.queue.task_done()
                continue

            try:
                urllib.urlretrieve(url, path)
            except IOError:
                print "Error downloading url '%s'." % url

            #signals to queue job is done
            self.queue.task_done()

When you create tje connection with urllib2.urlopen() , you can give a timeout parameter. 使用urllib2.urlopen ()创建tje连接时,可以给出超时参数。

As described in the doc : 如文档中所述:

The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt (if not specified, the global default timeout setting will be used). 可选的timeout参数指定阻塞操作(如连接尝试)的超时(以秒为单位)(如果未指定,将使用全局默认超时设置)。 This actually only works for HTTP, HTTPS and FTP connections. 这实际上仅适用于HTTP,HTTPS和FTP连接。

With this you will be able to manage a maximum waiting duration and catch the exception raised. 通过这种方式,您将能够管理最长等待时间并捕获引发的异常。

The way I crawl a huge batch of documents is by having batch processor which crawls and dumps constant sized chunks. 我抓取大量文档的方法是使批处理器抓取并转储常量大小的块。

Suppose you are to crawl a pre-known batch of say 100K documents. 假设您要抓取一批已知的100K文档。 You can have some logic to generate constant size chunks of say 1000 documents which would be downloaded by a threadpool. 您可以使用一些逻辑来生成1000个文档的常量大小的块,这些块将由线程池下载。 Once the whole chunk is crawled, you can have bulk insert in your database. 对整个块进行爬网后,您可以在数据库中进行批量插入。 And then proceed with further 1000 documents and so on. 然后继续进一步1000个文档,等等。

Advantages you get by following this approach: 通过以下方法获得的优势:

  • You get the advantage of threadpool speeding up your crawl rate. 您可以获得线程池的优势,从而加快抓取速度。

  • Its fault tolerant in the sense, you can continue from the chunk where it last failed. 它的容错能力,你可以从最后一次失败的大块继续。

  • You can have chunks generated on the basis of priority ie important documents to crawl first. 您可以根据优先级生成块,即首先爬行的重要文档。 So in case you are unable to complete the whole batch. 因此,如果您无法完成整批。 Important documents are processed and less important documents can be picked up later on the next run. 处理重要文档,下次运行后可以获取不太重要的文档。

An ugly hack that seems to work. 一个丑陋的黑客似乎工作。

import os, socket, threading, errno

def timeout_http_body_read(response, timeout = 60):
    def murha(resp):
        os.close(resp.fileno())
        resp.close()

    # set a timer to yank the carpet underneath the blocking read() by closing the os file descriptor
    t = threading.Timer(timeout, murha, (response,))
    try:
        t.start()
        body = response.read()
        t.cancel()
    except socket.error as se:
        if se.errno == errno.EBADF: # murha happened
            return (False, None)
        raise
    return (True, body)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM