[英]Issues on following links in scrapy
I want to crawl a blog which has several categories of websites . 我想抓取一个包含几个类别的网站的博客。 Starting navigating the page from the first category, my goal is to collect every webpage by following the categories .
从第一个类别开始浏览页面,我的目标是按照类别收集每个网页。 I have collected the websites from the 1st category but the spider stops there , can't reach the 2nd category .
我从第一类收集网站,但是蜘蛛停在这里,无法进入第二类。
An example draft : 草稿示例:
my code : 我的代码:
import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from final.items import DmozItem
class my_spider(CrawlSpider):
name = 'heart'
allowed_domains = ['greek-sites.gr']
start_urls = ['http://www.greek-sites.gr/categories/istoselides-athlitismos']
rules = (Rule(LinkExtractor(allow=(r'.*categories/.*', )), callback='parse', follow=True),)
def parse(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
categories = response.xpath('//a[contains(@href, "categories")]/text()').extract()
for category in categories:
item = DmozItem()
item['title'] = response.xpath('//a[contains(text(),"gr")]/text()').extract()
item['category'] = response.xpath('//div/strong/text()').extract()
return item
The problem is simple: the callback
has to be different than parse
, so I suggest you name your method parse_site
for example and then you are ready to continue your scraping. 问题很简单:
callback
必须不同于parse
,因此我建议您以方法parse_site
为例,然后准备继续进行抓取。
If you make the change below it will work: 如果您在下面进行更改,它将起作用:
rules = (Rule(LinkExtractor(allow=(r'.*categories/.*', )), callback='parse_site', follow=True),)
def parse_site(self, response):
The reason for this is described in the docs : 其原因在docs中进行了描述:
When writing crawl spider rules, avoid using
parse
as callback, since theCrawlSpider
uses theparse
method itself to implement its logic.编写爬网蜘蛛规则时,请避免将
parse
用作回调,因为CrawlSpider
使用parse
方法本身来实现其逻辑。 So if you override theparse
method, the crawl spider will no longer work.因此,如果您覆盖
parse
方法,则爬网蜘蛛将不再起作用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.