简体   繁体   中英

Issues on following links in scrapy

I want to crawl a blog which has several categories of websites . Starting navigating the page from the first category, my goal is to collect every webpage by following the categories . I have collected the websites from the 1st category but the spider stops there , can't reach the 2nd category .

An example draft :

草稿范例

my code :

import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from final.items import DmozItem

    class my_spider(CrawlSpider):
    name = 'heart'
    allowed_domains = ['greek-sites.gr']
    start_urls = ['http://www.greek-sites.gr/categories/istoselides-athlitismos']

    rules = (Rule(LinkExtractor(allow=(r'.*categories/.*', )), callback='parse', follow=True),)


    def parse(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        categories = response.xpath('//a[contains(@href, "categories")]/text()').extract()
        for category in categories:
            item = DmozItem()
            item['title'] = response.xpath('//a[contains(text(),"gr")]/text()').extract() 
            item['category'] = response.xpath('//div/strong/text()').extract() 
        return item

The problem is simple: the callback has to be different than parse , so I suggest you name your method parse_site for example and then you are ready to continue your scraping.

If you make the change below it will work:

rules = (Rule(LinkExtractor(allow=(r'.*categories/.*', )), callback='parse_site', follow=True),)

def parse_site(self, response):

The reason for this is described in the docs :

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM