简体   繁体   English

刮spider蜘蛛的2个功能,第二个不运行

[英]2 functions in scrapy spider and the second one not running

I am using scrapy to get the content inside some urls on a page, similar to this question here: Use scrapy to get list of urls, and then scrape content inside those urls 我正在使用scrapy来获取页面上某些url中的内容,类似于此处的问题: 使用scrapy来获取url列表,然后在这些url中获取内容

I am able to get the subURLs from my start urls(first def), However, my second def doesn't seem to be passing through. 我可以从起始URL(第一个def)中获取subURL,但是,第二个def似乎没有通过。 And the result file is empty. 结果文件为空。 I have tested the content inside the function in scrapy shell and it is getting the info I want, but not when I am running the spider. 我已经在scrapy shell中测试了函数内部的内容,它正在获取我想要的信息,但是当我运行Spider时却没有。

 import scrapy from scrapy.selector import Selector #from scrapy import Spider from WheelsOnlineScrapper.items import Dealer from WheelsOnlineScrapper.url_list import urls import logging from urlparse import urljoin logger = logging.getLogger(__name__) class WheelsonlinespiderSpider(scrapy.Spider): logger.info('Spider starting') name = 'wheelsonlinespider' rotate_user_agent = True # lives in middleware.py and settings.py allowed_domains = ["https://wheelsonline.ca"] start_urls = urls # this list is created in url_list.py logger.info('URLs retrieved') def parse(self, response): subURLs = [] partialURLs = response.css('.directory_name::attr(href)').extract() for i in partialURLs: subURLs = urljoin('https://wheelsonline.ca/', i) yield scrapy.Request(subURLs, callback=self.parse_dealers) logger.info('Dealer ' + subURLs + ' fetched') def parse_dealers(self, response): logger.info('Beginning of page') dlr = Dealer() #Extracting the content using css selectors try: dlr['DealerName'] = response.css(".dealer_head_main_name::text").extract_first() + ' ' + response.css(".dealer_head_aux_name::text").extract_first() except TypeError: dlr['DealerName'] = response.css(".dealer_head_main_name::text").extract_first() dlr['MailingAddress'] = ','.join(response.css(".dealer_address_right::text").extract()) dlr['PhoneNumber'] = response.css(".dealer_head_phone::text").extract_first() logger.info('Dealer fetched ' + dlr['DealerName']) yield dlr logger.info('End of page') 

Your allowed_domains list contains the protocol ( https ). 您的allowed_domains列表包含协议( https )。 It should have only the domain name as per the documentation : 根据文档,它应该仅具有域名:

allowed_domains = ["wheelsonline.ca"]

Also, you should've received a message in your log: 另外,您应该在日志中收到一条消息:

URLWarning: allowed_domains accepts only domains, not URLs. URLWarning:allowed_domains仅接受域,而不接受URL。 Ignoring URL entry https://wheelsonline.ca in allowed_domains 忽略allowed_domains中的URL条目https://wheelsonline.ca

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM