I'm trying to navigate to each county and then each city in each county from here:http://www.accountant-finder.com/CA/California-accountants.html
My code opens the main page listed above, scrapes the title per the parser function, but does not seem to apply the rule to follow the county links (relative paths) starting with "/CA/" (like CA/Alameda/Alameda_county-California-accountants.html).
I've tried modifying the rule using various reg-ex's to no avail. What am I missing?
import scrapy
from scrapy.spiders import CrawlSpider,Rule
from acctfinder.items import Accountant
from scrapy.linkextractors import LinkExtractor
class AccountantSpider(CrawlSpider):
name = "Accountant"
allowed_domains = ["accountant-finder.com"]
start_urls = ["http://www.accountant-finder.com/CA/California-accountants.html"]
rules =(Rule(LinkExtractor(allow=('\/CA\/.*',)),callback="parse_item",follow=True),)
def parse(self,response):
item = Accountant()
title = response.xpath('//h1/text()')[0].extract()
print("title is: "+title)
item['title'] = title
return item
This is a common mistake when using CrawlSpider
. Checking the documentation closely it specifies, you shouldn't be using the parse
method .
Another thing about your spider, the rule specifies that each item should be processed in the parse_item
method. so just change the parse
method to parse_item
and it should start working.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.