urllib2 urlopen讀取超時/塊

Question

最近我正在研究一個小型爬蟲，用於在網址上下載圖像。

我在urllib2中使用openurl（）和f.open（）/ f.write（）：

這是代碼片段：

# the list for the images' urls
imglist = re.findall(regImg,pageHtml)

# iterate to download images
for index in xrange(1,len(imglist)+1):
    img = urllib2.urlopen(imglist[index-1])
    f = open(r'E:\OK\%s.jpg' % str(index), 'wb')
    print('To Read...')

    # potential timeout, may block for a long time
    # so I wonder whether there is any mechanism to enable retry when time exceeds a certain threshold
    f.write(img.read())
    f.close()
    print('Image %d is ready !' % index)

在上面的代碼中，img.read（）可能會阻塞很長時間，我希望在這個問題下做一些重試/重新打開圖像url操作。

我還關注上面代碼的高效視角，如果要下載的圖像數量有點大，使用線程池下載它們似乎更好。

有什么建議？ 提前致謝。

ps我發現img對象上的read（）方法可能會導致阻塞，因此單獨向urlopen（）添加timeout參數似乎沒用。 但我發現文件對象沒有read（）的超時版本。 有什么建議嗎？ 非常感謝。

Answer 1

urllib2.urlopen有一個timeout參數，用於所有阻塞操作（連接建立等）

這個片段取自我的一個項目。 我使用線程池一次下載多個文件。 它使用urllib.urlretrieve但邏輯是相同的。 url_and_path_list是(url, path)元組的列表， num_concurrent是要生成的線程數， skip_existing跳過文件系統中已存在的文件下載。

def download_urls(url_and_path_list, num_concurrent, skip_existing):
    # prepare the queue
    queue = Queue.Queue()
    for url_and_path in url_and_path_list:
        queue.put(url_and_path)

    # start the requested number of download threads to download the files
    threads = []
    for _ in range(num_concurrent):
        t = DownloadThread(queue, skip_existing)
        t.daemon = True
        t.start()

    queue.join()

class DownloadThread(threading.Thread):
    def __init__(self, queue, skip_existing):
        super(DownloadThread, self).__init__()
        self.queue = queue
        self.skip_existing = skip_existing

    def run(self):
        while True:
            #grabs url from queue
            url, path = self.queue.get()

            if self.skip_existing and exists(path):
                # skip if requested
                self.queue.task_done()
                continue

            try:
                urllib.urlretrieve(url, path)
            except IOError:
                print "Error downloading url '%s'." % url

            #signals to queue job is done
            self.queue.task_done()

Answer 2

使用urllib2.urlopen （）創建tje連接時，可以給出超時參數。

如文檔中所述：

可選的timeout參數指定阻塞操作（如連接嘗試）的超時（以秒為單位）（如果未指定，將使用全局默認超時設置）。 這實際上僅適用於HTTP，HTTPS和FTP連接。

通過這種方式，您將能夠管理最長等待時間並捕獲引發的異常。

Answer 3

我抓取大量文檔的方法是使批處理器抓取並轉儲常量大小的塊。

假設您要抓取一批已知的100K文檔。 您可以使用一些邏輯來生成1000個文檔的常量大小的塊，這些塊將由線程池下載。 對整個塊進行爬網后，您可以在數據庫中進行批量插入。 然后繼續進一步1000個文檔，等等。

通過以下方法獲得的優勢：

您可以獲得線程池的優勢，從而加快抓取速度。
它的容錯能力，你可以從最后一次失敗的大塊繼續。
您可以根據優先級生成塊，即首先爬行的重要文檔。 因此，如果您無法完成整批。 處理重要文檔，下次運行后可以獲取不太重要的文檔。

Answer 4

一個丑陋的黑客似乎工作。

import os, socket, threading, errno

def timeout_http_body_read(response, timeout = 60):
    def murha(resp):
        os.close(resp.fileno())
        resp.close()

    # set a timer to yank the carpet underneath the blocking read() by closing the os file descriptor
    t = threading.Timer(timeout, murha, (response,))
    try:
        t.start()
        body = response.read()
        t.cancel()
    except socket.error as se:
        if se.errno == errno.EBADF: # murha happened
            return (False, None)
        raise
    return (True, body)

urllib2 urlopen讀取超時/塊

問題描述

4 個解決方案

解決方案1
2 已采納 2012-12-06 16:28:10

解決方案2
1 2012-12-06 16:29:12

解決方案3
1 2012-12-06 16:36:12

解決方案4
0 2013-08-27 14:49:14

urllib2 urlopen讀取超時/塊

問題描述

4 個解決方案

解決方案1 2 已采納 2012-12-06 16:28:10

解決方案2 1 2012-12-06 16:29:12

解決方案3 1 2012-12-06 16:36:12

解決方案4 0 2013-08-27 14:49:14

解決方案1
2 已采納 2012-12-06 16:28:10

解決方案2
1 2012-12-06 16:29:12

解決方案3
1 2012-12-06 16:36:12

解決方案4
0 2013-08-27 14:49:14