Issues on following links in scrapy

Question

I want to crawl a blog which has several categories of websites . Starting navigating the page from the first category, my goal is to collect every webpage by following the categories . I have collected the websites from the 1st category but the spider stops there , can't reach the 2nd category .

An example draft :

my code :

import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from final.items import DmozItem

    class my_spider(CrawlSpider):
    name = 'heart'
    allowed_domains = ['greek-sites.gr']
    start_urls = ['http://www.greek-sites.gr/categories/istoselides-athlitismos']

    rules = (Rule(LinkExtractor(allow=(r'.*categories/.*', )), callback='parse', follow=True),)


    def parse(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        categories = response.xpath('//a[contains(@href, "categories")]/text()').extract()
        for category in categories:
            item = DmozItem()
            item['title'] = response.xpath('//a[contains(text(),"gr")]/text()').extract() 
            item['category'] = response.xpath('//div/strong/text()').extract() 
        return item

Answer 1

The problem is simple: the callback has to be different than parse , so I suggest you name your method parse_site for example and then you are ready to continue your scraping.

If you make the change below it will work:

rules = (Rule(LinkExtractor(allow=(r'.*categories/.*', )), callback='parse_site', follow=True),)

def parse_site(self, response):

The reason for this is described in the docs :

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

Issues on following links in scrapy

Question

1 answers

solution1
2 ACCPTED 2015-08-28 06:47:48

Issues on following links in scrapy

Question

1 answers

solution1 2 ACCPTED 2015-08-28 06:47:48

solution1
2 ACCPTED 2015-08-28 06:47:48