刮spider蜘蛛的2个功能，第二个不运行

Question

I am using scrapy to get the content inside some urls on a page, similar to this question here: Use scrapy to get list of urls, and then scrape content inside those urls 我正在使用scrapy来获取页面上某些url中的内容，类似于此处的问题：使用scrapy来获取url列表，然后在这些url中获取内容

I am able to get the subURLs from my start urls(first def), However, my second def doesn't seem to be passing through. 我可以从起始URL（第一个def）中获取subURL，但是，第二个def似乎没有通过。 And the result file is empty. 结果文件为空。 I have tested the content inside the function in scrapy shell and it is getting the info I want, but not when I am running the spider. 我已经在scrapy shell中测试了函数内部的内容，它正在获取我想要的信息，但是当我运行Spider时却没有。

 import scrapy from scrapy.selector import Selector #from scrapy import Spider from WheelsOnlineScrapper.items import Dealer from WheelsOnlineScrapper.url_list import urls import logging from urlparse import urljoin logger = logging.getLogger(__name__) class WheelsonlinespiderSpider(scrapy.Spider): logger.info('Spider starting') name = 'wheelsonlinespider' rotate_user_agent = True # lives in middleware.py and settings.py allowed_domains = ["https://wheelsonline.ca"] start_urls = urls # this list is created in url_list.py logger.info('URLs retrieved') def parse(self, response): subURLs = [] partialURLs = response.css('.directory_name::attr(href)').extract() for i in partialURLs: subURLs = urljoin('https://wheelsonline.ca/', i) yield scrapy.Request(subURLs, callback=self.parse_dealers) logger.info('Dealer ' + subURLs + ' fetched') def parse_dealers(self, response): logger.info('Beginning of page') dlr = Dealer() #Extracting the content using css selectors try: dlr['DealerName'] = response.css(".dealer_head_main_name::text").extract_first() + ' ' + response.css(".dealer_head_aux_name::text").extract_first() except TypeError: dlr['DealerName'] = response.css(".dealer_head_main_name::text").extract_first() dlr['MailingAddress'] = ','.join(response.css(".dealer_address_right::text").extract()) dlr['PhoneNumber'] = response.css(".dealer_head_phone::text").extract_first() logger.info('Dealer fetched ' + dlr['DealerName']) yield dlr logger.info('End of page')

Answer 1

Your allowed_domains list contains the protocol ( https ). 您的allowed_domains列表包含协议（ https ）。 It should have only the domain name as per the documentation : 根据文档，它应该仅具有域名：

allowed_domains = ["wheelsonline.ca"]

Also, you should've received a message in your log: 另外，您应该在日志中收到一条消息：

URLWarning: allowed_domains accepts only domains, not URLs. URLWarning：allowed_domains仅接受域，而不接受URL。 Ignoring URL entry https://wheelsonline.ca in allowed_domains 忽略allowed_domains中的URL条目https://wheelsonline.ca

刮spider蜘蛛的2个功能，第二个不运行

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-02-22 05:32:04

刮spider蜘蛛的2个功能，第二个不运行

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-02-22 05:32:04

解决方案1
0 已采纳 2019-02-22 05:32:04