Scrapy Crawlspider not crawling is it the RegEx?

Question

I'm trying to navigate to each county and then each city in each county from here:http://www.accountant-finder.com/CA/California-accountants.html

My code opens the main page listed above, scrapes the title per the parser function, but does not seem to apply the rule to follow the county links (relative paths) starting with "/CA/" (like CA/Alameda/Alameda_county-California-accountants.html).

I've tried modifying the rule using various reg-ex's to no avail. What am I missing?

import scrapy
from scrapy.spiders import CrawlSpider,Rule
from acctfinder.items import Accountant
from scrapy.linkextractors import LinkExtractor


class AccountantSpider(CrawlSpider):
    name = "Accountant"
    allowed_domains = ["accountant-finder.com"]
    start_urls = ["http://www.accountant-finder.com/CA/California-accountants.html"]
    rules =(Rule(LinkExtractor(allow=('\/CA\/.*',)),callback="parse_item",follow=True),)

    def parse(self,response):
        item = Accountant()
        title = response.xpath('//h1/text()')[0].extract()
        print("title is: "+title)
        item['title'] = title
        return item

Answer 1

This is a common mistake when using CrawlSpider . Checking the documentation closely it specifies, you shouldn't be using the parse method .

Another thing about your spider, the rule specifies that each item should be processed in the parse_item method. so just change the parse method to parse_item and it should start working.

Scrapy Crawlspider not crawling is it the RegEx?

Question

1 answers

solution1
2 2019-12-09 03:25:20

Scrapy Crawlspider not crawling is it the RegEx?

Question

1 answers

solution1 2 2019-12-09 03:25:20

solution1
2 2019-12-09 03:25:20