简体   繁体   中英

scrapy CrawlSpider: modify deny rules list while crawling

I need to update deny list in crawl rule while crawling website. (ie i want to dynamically modify deny rules list while my spider is working)

what i tried is

deny = ['a','b','c']
rules = ( Rule(LinkExtractor(allow=('/r/','/p/' ), deny=deny), callback='parse_item', follow=True), )

and then performed self.deny.append(unique_category) in parse_item() function but it did not worked as I expected: updated deny list was ignored(crawler still went to same category again and again).

I would appreciate any suggestions. thanks

There are two ways to do this:

  1. Poke around in the internal APIs of the link extractors, which isn't guaranteed to survive version changes:

     self.rules[0].deny_res.append(re.compile('d')) 
  2. Make your own version (or subclass) of CrawlSpider that does exactly what you want.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM