Stop Scrapy crawling the same URLs

Question

I've written a basic Scrapy spider to crawl a website which seems to run fine other than the fact it doesn't want to stop, ie it keeps revisiting the same urls and returning the same content - I always end up having to stop it. I suspect it's going over the same urls over and over again. Is there a rule that will stop this? Or is there something else I have to do? Maybe middleware?

The Spider is as below:

class LsbuSpider(CrawlSpider):
name = "lsbu6"
allowed_domains = ["lsbu.ac.uk"]
start_urls = [
    "http://www.lsbu.ac.uk"
]
rules = [
    Rule(SgmlLinkExtractor(allow=['lsbu.ac.uk/business-and-partners/.+']), callback='parse_item', follow=True),
]

def parse_item(self, response):
    join = Join()
    sel = Selector(response)
    bits = sel.xpath('//*')
    scraped_bits = []            
    for bit in bits:
        scraped_bit = LsbuItem()
        scraped_bit['title'] = scraped_bit.xpath('//title/text()').extract()
        scraped_bit['desc'] = join(bit.xpath('//*[@id="main_content_main_column"]//text()').extract()).strip()
        scraped_bits.append(scraped_bit)

    return scraped_bits

My settings.py file looks like this

BOT_NAME = 'lsbu6'
DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
DUPEFILTER_DEBUG = True
SPIDER_MODULES = ['lsbu.spiders']
NEWSPIDER_MODULE = 'lsbu.spiders'

Any help/ guidance/ instruction on stopping it running continuously would be greatly appreciated...

As I'm a newbie to this; any comments on tidying the code up would also be helpful (or links to good instruction).

Thanks...

Answer 1

The DupeFilter is enabled by default: http://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class and it's based on the request url.

I tried a simplified version of your spider on a new vanilla scrapy project without any custom configuration. The dupefilter worked and the crawl stopped after a few requests. I'd say you have something wrong on your settings or on your scrapy version. I'd suggest you to upgrade to scrapy 1.0, just to be sure :)

$ pip install scrapy --pre

The simplified spider I tested:

from scrapy.spiders import CrawlSpider
from scrapy.linkextractors import LinkExtractor
from scrapy import Item, Field
from scrapy.spiders import Rule 

class LsbuItem(Item):
    title = Field()
    url = Field()

class LsbuSpider(CrawlSpider):
    name = "lsbu6"
    allowed_domains = ["lsbu.ac.uk"]

    start_urls = [
        "http://www.lsbu.ac.uk"
    ]    

    rules = [
        Rule(LinkExtractor(allow=['lsbu.ac.uk/business-and-partners/.+']), callback='parse_item', follow=True),
    ]    

    def parse_item(self, response):
        scraped_bit = LsbuItem()
        scraped_bit['url'] = response.url
        yield scraped_bit

Answer 2

Your design makes the crawl go in circles. For examples, there is a page http://www.lsbu.ac.uk/business-and-partners/business , which when opened contains the link to " http://www.lsbu.ac.uk/business-and-partners/partners , and that one contains again the link to the first one. Thus, you go in circles indefinitely.

In order to overcome this, you need to create better rules, eliminating the circular references. And also, you have two identical rules defined, which is not needed. If you want the follow you can always put it on the same rule, you don't need a new rule.

Stop Scrapy crawling the same URLs

Question

2 answers

solution1
3 ACCPTED 2015-06-04 15:18:10

solution2
1 2015-04-26 22:17:48

Stop Scrapy crawling the same URLs

Question

2 answers

solution1 3 ACCPTED 2015-06-04 15:18:10

solution2 1 2015-04-26 22:17:48

solution1
3 ACCPTED 2015-06-04 15:18:10

solution2
1 2015-04-26 22:17:48