Scrapy內存錯誤（請求太多）Python 2.7

Question

我一直在Scrapy中運行一個搜尋器，以搜尋一個我不想提及的大型網站。 我使用蜘蛛教程作為模板，然后創建了一系列啟動請求，並使用以下內容從那里爬行：

def start_requests(self):
        f = open('zipcodes.csv', 'r')
        lines = f.readlines()
        for line in lines:
            zipcode = int(line)
            yield self.make_requests_from_url("http://www.example.com/directory/%05d" % zipcode)

首先，有超過10,000個這樣的頁面，然后每個頁面都排隊到一個很大的目錄中，從該目錄中還有更多頁面要排隊，依此類推，而scrapy似乎喜歡保持“淺”狀態，將請求堆積在內存中而不是鑽研它們，然后再備份。

這樣的結果是一個重復的大異常，其結束像這樣：

  File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 57, in <genexpr>
    work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
  File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 96, in iter_errback
    yield next(it)

.....（多行）...

  File "C:\Python27\lib\site-packages\scrapy\selector\lxmldocument.py", line 13, in _factory
    body = response.body_as_unicode().strip().encode('utf8') or '<html/>'
exceptions.MemoryError:

相當快地，在一個需要花費幾天時間的搜尋器的一個小時左右的時間內，python可執行文件氣球膨脹到1.8gigs和Scrapy將不再起作用（繼續使我浪費很多美元的代理使用費！）。

有什么方法可以使Scrapy出隊，外部化或迭代（我什至不知道正確的詞）存儲的請求，以防止此類內存問題？

（除了將我在這里或在文檔中看到的內容拼湊在一起，我不是很精通編程，所以說不上我可以進行故障排除，可以這么說-我也無法安裝完整的python /經過數天的嘗試和閱讀，django / scrapy在W7上為64位。）

Answer 1

在整個Internet上遞歸鏈接時，您將無法關閉。 您將需要以一種或另一種方式限制遞歸。 不幸的是，沒有顯示執行此操作的代碼部分。 最簡單的方法是為要爬網的待處理鏈接列表設置一個固定的大小，只是在列表中的大小小於此上限之前，不要再添加其他任何內容。 更高級的解決方案將基於父頁面中的待處理鏈接為其分配優先級，然后將排序后的結果添加到待處理鏈接的已排序，固定最大大小優先級列表中。

但是，您應該查看內置設置是否可以完成所需的操作，而不是嘗試編輯或修改現有代碼。 請參閱此文檔頁面以供參考： http : //doc.scrapy.org/en/latest/topics/settings.html 。 看起來DEPTH_LIMIT設置的值DEPTH_LIMIT或DEPTH_LIMIT 1會限制您從起始頁面的遞歸深度。

Answer 2

您可以通過每次蜘蛛空閑時僅一次排隊幾個網址來分批處理您的網址。 這樣可以避免許多請求在內存中排隊。 下面的示例僅從數據庫/文件中讀取下一批URL，並且僅在完成所有先前請求之后，才將它們作為請求排隊。

有關spider_idle信號的更多信息： http : spider_idle

有關調試內存泄漏的更多信息： http : //doc.scrapy.org/en/latest/topics/leaks.html

from scrapy import signals, Spider
from scrapy.xlib.pydispatch import dispatcher


class ExampleSpider(Spider):
    name = "example"
    start_urls = ['http://www.example.com/']

    def __init__(self, *args, **kwargs):
        super(ExampleSpider, self).__init__(*args, **kwargs)
        # connect the function to the spider_idle signal
        dispatcher.connect(self.queue_more_requests, signals.spider_idle)

    def queue_more_requests(self, spider):
        # this function will run everytime the spider is done processing
        # all requests/items (i.e. idle)

        # get the next urls from your database/file
        urls = self.get_urls_from_somewhere()

        # if there are no longer urls to be processed, do nothing and the
        # the spider will now finally close
        if not urls:
            return

        # iterate through the urls, create a request, then send them back to
        # the crawler, this will get the spider out of its idle state
        for url in urls:
            req = self.make_requests_from_url(url)
            self.crawler.engine.crawl(req, spider)

    def parse(self, response):
        pass

Scrapy內存錯誤（請求太多）Python 2.7

問題描述

2 個解決方案

解決方案1
1 2015-06-16 21:07:12

解決方案2
1 已采納 2015-06-17 03:00:40

Scrapy內存錯誤（請求太多）Python 2.7

問題描述

2 個解決方案

解決方案1 1 2015-06-16 21:07:12

解決方案2 1 已采納 2015-06-17 03:00:40

解決方案1
1 2015-06-16 21:07:12

解決方案2
1 已采納 2015-06-17 03:00:40