簡體   English   中英

刮spider蜘蛛的2個功能,第二個不運行

[英]2 functions in scrapy spider and the second one not running

我正在使用scrapy來獲取頁面上某些url中的內容,類似於此處的問題: 使用scrapy來獲取url列表,然后在這些url中獲取內容

我可以從起始URL(第一個def)中獲取subURL,但是,第二個def似乎沒有通過。 結果文件為空。 我已經在scrapy shell中測試了函數內部的內容,它正在獲取我想要的信息,但是當我運行Spider時卻沒有。

 import scrapy from scrapy.selector import Selector #from scrapy import Spider from WheelsOnlineScrapper.items import Dealer from WheelsOnlineScrapper.url_list import urls import logging from urlparse import urljoin logger = logging.getLogger(__name__) class WheelsonlinespiderSpider(scrapy.Spider): logger.info('Spider starting') name = 'wheelsonlinespider' rotate_user_agent = True # lives in middleware.py and settings.py allowed_domains = ["https://wheelsonline.ca"] start_urls = urls # this list is created in url_list.py logger.info('URLs retrieved') def parse(self, response): subURLs = [] partialURLs = response.css('.directory_name::attr(href)').extract() for i in partialURLs: subURLs = urljoin('https://wheelsonline.ca/', i) yield scrapy.Request(subURLs, callback=self.parse_dealers) logger.info('Dealer ' + subURLs + ' fetched') def parse_dealers(self, response): logger.info('Beginning of page') dlr = Dealer() #Extracting the content using css selectors try: dlr['DealerName'] = response.css(".dealer_head_main_name::text").extract_first() + ' ' + response.css(".dealer_head_aux_name::text").extract_first() except TypeError: dlr['DealerName'] = response.css(".dealer_head_main_name::text").extract_first() dlr['MailingAddress'] = ','.join(response.css(".dealer_address_right::text").extract()) dlr['PhoneNumber'] = response.css(".dealer_head_phone::text").extract_first() logger.info('Dealer fetched ' + dlr['DealerName']) yield dlr logger.info('End of page') 

您的allowed_domains列表包含協議( https )。 根據文檔,它應該僅具有域名:

allowed_domains = ["wheelsonline.ca"]

另外,您應該在日志中收到一條消息:

URLWarning:allowed_domains僅接受域,而不接受URL。 忽略allowed_domains中的URL條目https://wheelsonline.ca

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM