Scrapy - Change rules after spider starts crawling

Question

My query is for the CrawlSpider

I understand the link extractor rules is a static variable,

Can i change the rules in runtime say, like

@classmethod
def set_rules(cls,rules):
 cls.rules = rules

by

self.set_rules(rules)

Is this the acceptable practice for the CrawlSpider ? if not please suggest the appropriate method

My use case,

I'm using scrapy to crawl certain categories A,B,C....Z of a particular website. each category has 1000 links spread over 10 pages

and when scrapy hits a link in a some category which is "too old". I'd like the crawler to stop following/crawling the remainder of the 10 pages ONLY for that category alone and thus my requirement of dynamic rule changes.

Please point me out on the right direction.

Thanks!

Answer 1

I would suggest to you to write your own custom downloader middleware. These would allow you to filter out those requests that you not longer want to make.

Further details about the architecture overview of Scrapy can you find here: http://doc.scrapy.org/en/master/topics/architecture.html

And about downloader middleware and how to write your custom one: http://doc.scrapy.org/en/master/topics/downloader-middleware.html

Answer 2

The rules in a spider aren't meant to be changed dynamically. They are compiled at instantiation of the CrawlSpider. You could always change your spider.rules and re-run spider._compile_rules() , but I advise against it.

The rules create a set of instructions for the Crawler in what to queue up to crawl (ie. it queues Requests ). These requests aren't revisited and re-evaluated before they are dispatched, as the rules weren't "designed" to change. So even if you did change the rules dynamically, you may still end up making a bunch of requests you didn't intend to, and still crawl a bunch of content you didn't mean to.

For instance, if your target page is setup so that the page for "Category A" contains links to pages 1 to 10 of "Category A", then Scrapy will queue up requests for all 10 of these pages. If Page 2 turns out to have entries that are "too old", changing the rules will do nothing because requests for pages 3-10 are already queued to go .

As @imx51 said, it would be much better to write a Downloader Middleware. These would be able to drop each request that you not longer want to make as they trigger for every request going through it before it's downloaded.

Scrapy - Change rules after spider starts crawling

Question

2 answers

solution1
1 2015-08-13 07:17:31

solution2
1 2015-08-13 14:36:07

Scrapy - Change rules after spider starts crawling

Question

2 answers

solution1 1 2015-08-13 07:17:31

solution2 1 2015-08-13 14:36:07

solution1
1 2015-08-13 07:17:31

solution2
1 2015-08-13 14:36:07