有关scrapy以下链接的问题

Question

I want to crawl a blog which has several categories of websites . 我想抓取一个包含几个类别的网站的博客。 Starting navigating the page from the first category, my goal is to collect every webpage by following the categories . 从第一个类别开始浏览页面，我的目标是按照类别收集每个网页。 I have collected the websites from the 1st category but the spider stops there , can't reach the 2nd category . 我从第一类收集网站，但是蜘蛛停在这里，无法进入第二类。

An example draft : 草稿示例：

my code : 我的代码：

import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from final.items import DmozItem

    class my_spider(CrawlSpider):
    name = 'heart'
    allowed_domains = ['greek-sites.gr']
    start_urls = ['http://www.greek-sites.gr/categories/istoselides-athlitismos']

    rules = (Rule(LinkExtractor(allow=(r'.*categories/.*', )), callback='parse', follow=True),)


    def parse(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        categories = response.xpath('//a[contains(@href, "categories")]/text()').extract()
        for category in categories:
            item = DmozItem()
            item['title'] = response.xpath('//a[contains(text(),"gr")]/text()').extract() 
            item['category'] = response.xpath('//div/strong/text()').extract() 
        return item

Answer 1

The problem is simple: the callback has to be different than parse , so I suggest you name your method parse_site for example and then you are ready to continue your scraping. 问题很简单： callback必须不同于parse ，因此我建议您以方法parse_site为例，然后准备继续进行抓取。

If you make the change below it will work: 如果您在下面进行更改，它将起作用：

rules = (Rule(LinkExtractor(allow=(r'.*categories/.*', )), callback='parse_site', follow=True),)

def parse_site(self, response):

The reason for this is described in the docs : 其原因在docs中进行了描述：

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. 编写爬网蜘蛛规则时，请避免将parse用作回调，因为CrawlSpider使用parse方法本身来实现其逻辑。 So if you override the parse method, the crawl spider will no longer work. 因此，如果您覆盖parse方法，则爬网蜘蛛将不再起作用。

有关scrapy以下链接的问题

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-08-28 06:47:48

有关scrapy以下链接的问题

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-08-28 06:47:48

解决方案1
2 已采纳 2015-08-28 06:47:48