簡體   English   中英

Python Scrapy 在 40k 請求后卡住

[英]Python Scrapy gets stuck after 40k requests

我的 scrapy 蜘蛛在 40k 請求后卡住了。

我是 scrapy 的新手。 環顧四周,我想知道問題是否與使用默認的parse方法名稱和start_urls有關。

我正在使用custom_settings來加快速度。 如果 URL 在幾秒鍾內沒有解決,請繼續前進,不要重試。

蜘蛛代碼:

import scrapy, re, pandas as pd, os, logging
from scrapy.spiders import CrawlSpider, Rule
from ..items import SfscrapeItem
from scrapy.utils.log import configure_logging 

dirname = os.path.dirname(__file__)
filename = os.path.join(dirname, '../resources/scraping_urls_1.1.csv')
df = pd.read_csv(filename, index_col=0)

keywords = ['canidae', 'felidae', 'cat', 'cattle', 'dog', 'donkey', 'goat', 'guinea pig', 'horse', 'pig', 'rabbit']

class sfSpider(scrapy.Spider):
    name='sfspider'

    custom_settings = {
        'DNS_TIMEOUT': 10,
        'DOWNLOAD_TIMEOUT':10,
        'RETRY_ENABLED': False,
        'REDIRECT_MAX_TIMES': 2,
    }

    configure_logging(install_root_handler=False)
    logging.basicConfig(
        filename='log.txt',
        format='%(levelname)s: %(message)s',
        level=logging.INFO
    )

    #just the first 75k for now...
    start_urls = df.url.to_list()[:75000] 

    def parse(self, response):

        response_body = response.body.decode('utf-8')

        url = response.url
        domain = url.split('/')[2]

        item = SfscrapeItem()
        item['url'] = url
        item['domain'] = domain
        item['status'] = response.status
        item['matches'] = [str(len(re.findall(keyword, response_body, re.IGNORECASE))) for keyword in keywords]
        
        yield item

這是卡住時的日志

INFO: Crawled 40940 pages (at 565 pages/min), scraped 16473 items (at 243 items/min)
INFO: Crawled 40940 pages (at 0 pages/min), scraped 16473 items (at 0 items/min)
INFO: Crawled 40940 pages (at 0 pages/min), scraped 16473 items (at 0 items/min)
INFO: Crawled 40940 pages (at 0 pages/min), scraped 16473 items (at 0 items/min)
INFO: Crawled 40940 pages (at 0 pages/min), scraped 16473 items (at 0 items/min)
INFO: Crawled 40940 pages (at 0 pages/min), scraped 16473 items (at 0 items/min)

這是卡住后stats.get_stats()prefs()的output。 在此處輸入圖像描述

看起來您開始收到DNSLookupError
因為你有這個downloader/exception_type_count/twisted.internet.DNSLookupError: 5109 in stats on minutes with ...pages (at 0 pages/min), scraped x items (at 0 items/min)

LogStats 擴展(打印INFO: Crawled X pages (at X pages/min), scraped X items (at X items/min)記錄)僅計算收到的響應(不計算收到的異常)。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM