简体   繁体   中英

Scrapy Crawlspider not crawling is it the RegEx?

I'm trying to navigate to each county and then each city in each county from here:http://www.accountant-finder.com/CA/California-accountants.html

My code opens the main page listed above, scrapes the title per the parser function, but does not seem to apply the rule to follow the county links (relative paths) starting with "/CA/" (like CA/Alameda/Alameda_county-California-accountants.html).

I've tried modifying the rule using various reg-ex's to no avail. What am I missing?

import scrapy
from scrapy.spiders import CrawlSpider,Rule
from acctfinder.items import Accountant
from scrapy.linkextractors import LinkExtractor


class AccountantSpider(CrawlSpider):
    name = "Accountant"
    allowed_domains = ["accountant-finder.com"]
    start_urls = ["http://www.accountant-finder.com/CA/California-accountants.html"]
    rules =(Rule(LinkExtractor(allow=('\/CA\/.*',)),callback="parse_item",follow=True),)

    def parse(self,response):
        item = Accountant()
        title = response.xpath('//h1/text()')[0].extract()
        print("title is: "+title)
        item['title'] = title
        return item

This is a common mistake when using CrawlSpider . Checking the documentation closely it specifies, you shouldn't be using the parse method .

Another thing about your spider, the rule specifies that each item should be processed in the parse_item method. so just change the parse method to parse_item and it should start working.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM