I want to crawl a blog which has several categories of websites . Starting navigating the page from the first category, my goal is to collect every webpage by following the categories . I have collected the websites from the 1st category but the spider stops there , can't reach the 2nd category .
An example draft :
my code :
import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from final.items import DmozItem
class my_spider(CrawlSpider):
name = 'heart'
allowed_domains = ['greek-sites.gr']
start_urls = ['http://www.greek-sites.gr/categories/istoselides-athlitismos']
rules = (Rule(LinkExtractor(allow=(r'.*categories/.*', )), callback='parse', follow=True),)
def parse(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
categories = response.xpath('//a[contains(@href, "categories")]/text()').extract()
for category in categories:
item = DmozItem()
item['title'] = response.xpath('//a[contains(text(),"gr")]/text()').extract()
item['category'] = response.xpath('//div/strong/text()').extract()
return item
The problem is simple: the callback
has to be different than parse
, so I suggest you name your method parse_site
for example and then you are ready to continue your scraping.
If you make the change below it will work:
rules = (Rule(LinkExtractor(allow=(r'.*categories/.*', )), callback='parse_site', follow=True),)
def parse_site(self, response):
The reason for this is described in the docs :
When writing crawl spider rules, avoid using
parse
as callback, since theCrawlSpider
uses theparse
method itself to implement its logic. So if you override theparse
method, the crawl spider will no longer work.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.