I need to update deny list in crawl rule while crawling website. (ie i want to dynamically modify deny rules list while my spider is working)
what i tried is
deny = ['a','b','c']
rules = ( Rule(LinkExtractor(allow=('/r/','/p/' ), deny=deny), callback='parse_item', follow=True), )
and then performed self.deny.append(unique_category)
in parse_item()
function but it did not worked as I expected: updated deny list was ignored(crawler still went to same category again and again).
I would appreciate any suggestions. thanks
There are two ways to do this:
Poke around in the internal APIs of the link extractors, which isn't guaranteed to survive version changes:
self.rules[0].deny_res.append(re.compile('d'))
Make your own version (or subclass) of CrawlSpider
that does exactly what you want.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.