python/scrapy question: How to avoid endless loops

Question

I am using the web-scraping framework, scrapy, to data mine some sites. I am trying to use the CrawlSpider and the pages have a 'back' and 'next' button. The URLs are in the format

www.qwerty.com/###

where ### is a number that increments every time the next button is pressed. How do I format the rules so that an endless loop doesn't occur.

Here is my rule:

rules = (
        Rule(SgmlLinkExtractor(allow='http://not-a-real-site.com/trunk-framework/791'),follow=True,callback='parse_item',
    ),
)

Answer 1

Endless loop shouldn't happen. Scrapy will filter out duplicate urls.

Answer 2

what makes you think the program will go into infinite loop, how have you tested it? scrapy wont download a url if it has already done it before. Did you try to go through all the pages, what happens when you click next on the last page?

You can get into infinite loop If the site generates a new number every time the next link is pressed. Although the case is broken site code but you can put a limit on the max number of pages in your code to avoid looping indefinitely.

Answer 3

You can set a limit on number of links to follow: use DEPTH_LIMIT setting.

Alternatively you can check the current depth in a parse callback function:

def parse(self, response):
    if response.meta['depth'] > 100:
        print 'Loop?'

python/scrapy question: How to avoid endless loops

Question

3 answers

solution1
8 2011-07-14 03:46:36

solution2
1 2011-07-16 17:22:44

solution3
1 2011-07-22 11:20:45

python/scrapy question: How to avoid endless loops

Question

3 answers

solution1 8 2011-07-14 03:46:36

solution2 1 2011-07-16 17:22:44

solution3 1 2011-07-22 11:20:45

solution1
8 2011-07-14 03:46:36

solution2
1 2011-07-16 17:22:44

solution3
1 2011-07-22 11:20:45