I am trying to crawl multiple websites using Scrapy link extractor and follow as TRUE (recursive) .. Looking for a solution to set the time limit to crawl for each url in start_urls list.
Thanks
import scrapy
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
You need to use download_timeout
meta parameter for scrapy.Request
.
To use it in starting urls, you need to override self.start_requests(self)
function, something like:
def start_requests(self):
# 10 seconds for first url
yield Request(self.start_urls[0], meta={'donwload_timeout': 10})
# 60 seconds for first url
yield Request(self.start_urls[1], meta={'donwload_timeout': 60})
You can read more about Request special meta keys here: http://doc.scrapy.org/en/latest/topics/request-response.html#request-meta-special-keys
You can use the CLOSESPIDER_TIMEOUT
setting
For example, call your spider like this:
scrapy crawl DmozSpider -s CLOSESPIDER_TIMEOUT=10
Use a Timeout object!
import signal
class Timeout(object):
"""Timeout class using ALARM signal."""
class TimeoutError(Exception):
pass
def __init__(self, sec):
self.sec = sec
def __enter__(self):
signal.signal(signal.SIGALRM, self.raise_timeout)
signal.alarm(self.sec)
def __exit__(self, *args):
signal.alarm(0)# disable alarm
def raise_timeout(self, *args):
raise Timeout.TimeoutError('TimeoutError')
Then you can call your extractor inside a with statement like this:
with Timeout(10): #10 seconds
try:
do_what_you_need_to_do
except Timeout.TimeoutError:
#break, continue or whatever else you may need
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.