简体   繁体   中英

In Scrapy, How to set time limit for each url?

I am trying to crawl multiple websites using Scrapy link extractor and follow as TRUE (recursive) .. Looking for a solution to set the time limit to crawl for each url in start_urls list.

Thanks

import scrapy

class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]
    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

You need to use download_timeout meta parameter for scrapy.Request .

To use it in starting urls, you need to override self.start_requests(self) function, something like:

def start_requests(self):
    # 10 seconds for first url
    yield Request(self.start_urls[0], meta={'donwload_timeout': 10})
    # 60 seconds for first url
    yield Request(self.start_urls[1], meta={'donwload_timeout': 60})

You can read more about Request special meta keys here: http://doc.scrapy.org/en/latest/topics/request-response.html#request-meta-special-keys

You can use the CLOSESPIDER_TIMEOUT setting

For example, call your spider like this:

scrapy crawl DmozSpider -s CLOSESPIDER_TIMEOUT=10

Use a Timeout object!

import signal

class Timeout(object):
    """Timeout class using ALARM signal."""
    class TimeoutError(Exception):
        pass

    def __init__(self, sec):
        self.sec = sec

    def __enter__(self):
        signal.signal(signal.SIGALRM, self.raise_timeout)
        signal.alarm(self.sec)

    def __exit__(self, *args):
        signal.alarm(0)# disable alarm

    def raise_timeout(self, *args):
        raise Timeout.TimeoutError('TimeoutError')

Then you can call your extractor inside a with statement like this:

with Timeout(10): #10 seconds
    try:
        do_what_you_need_to_do
    except Timeout.TimeoutError:
        #break, continue or whatever else you may need

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM