繁体   English   中英

scrapy.Request 似乎无法回调 url

[英]scrapy.Request appears unable to callback a url

我已经修改了我的代码来确定错误发生的位置。 我正在使用scrapy,在第一个“def parse”中我试图调用一个url,然后在下一个“def”中我试图抓取那个url。

但是,我似乎无法使 scray.Request 工作,它不会抓取 URL。

import scrapy
#from urllib.parse import urljoin

from CharlesChurch.items import CharleschurchItem

class charleschurchSpider(scrapy.Spider):
    name = "charleschurch"
    allowed_domains = ["charleschurch.com"]
    start_urls = ["https://www.charleschurch.com/sitemap"]

    def parse(self, response):
        # for href in response.xpath('//*[@class="contacts-item"]/ul/li/a/@href'):
        #     url = urljoin('https://www.charleschurch.com/',href.extract())
        #     yield scrapy.Request(url, callback=self.parse_dir_contents)
        url = 'https://www.charleschurch.com/north-yorkshire_harrogate/kingsley-park-10923'
        yield scrapy.Request(url, self.parse_dir_contents)

            
    def parse_dir_contents(self, response):
#    def parse(self, response):
        for sel in response.xpath('//*[@id="aspnetForm"]/div[4]'):
            item = CharleschurchItem()
            item['name'] = sel.xpath('//*[@id="XplodePage_ctl12_dsDetailsSnippet_pDetailsContainer"]/span[1]/b/text()').extract()
            item['address'] = sel.xpath('//*[@id="XplodePage_ctl12_dsDetailsSnippet_pDetailsContainer"]/div/*[@itemprop="postalCode"]/text()').extract()
            plotnames = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__name"]/a/text()').extract()
            plotnames = [plotname.strip() for plotname in plotnames]
            plotids = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__name"]/a/@href').extract()
            plotids = [plotid.strip() for plotid in plotids]
            plotprices = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__price"]/text()').extract()
            plotprices = [plotprice.strip() for plotprice in plotprices]
            result = zip(plotnames, plotids, plotprices)
            for plotname, plotid, plotprice in result:
                item['plotname'] = plotname
                item['plotid'] = plotid
                item['plotprice'] = plotprice
                yield item

我得到的错误是:

2020-09-08 22:12:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-09-08 22:12:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.charleschurch.com/sitemap> (referer: None)
2020-09-08 22:12:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.charleschurch.com/north-yorkshire_harrogate/kingsley-park-10923> (referer: https://www.charleschurch.com/sitemap)
2020-09-08 22:12:08 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.charleschurch.com/north-yorkshire_harrogate/kingsley-park-10923> (referer: https://www.charleschurch.com/sitemap)
Traceback (most recent call last):
  File "C:\Users\andre\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration: <200 https://www.charleschurch.com/north-yorkshire_harrogate/kingsley-park-10923>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\andre\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 55, in mustbe_deferred
    result = f(*args, **kw)
  File "C:\Users\andre\Anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 60, in process_spider_input
    return scrape_func(response, request, spider)
  File "C:\Users\andre\Anaconda3\lib\site-packages\scrapy\core\scraper.py", line 152, in call_spider
    warn_on_generator_with_return_value(spider, callback)
  File "C:\Users\andre\Anaconda3\lib\site-packages\scrapy\utils\misc.py", line 218, in warn_on_generator_with_return_value
    if is_generator_with_return_value(callable):
  File "C:\Users\andre\Anaconda3\lib\site-packages\scrapy\utils\misc.py", line 203, in is_generator_with_return_value
    tree = ast.parse(dedent(inspect.getsource(callable)))
  File "C:\Users\andre\Anaconda3\lib\ast.py", line 47, in parse
    return compile(source, filename, mode, flags,
  File "<unknown>", line 1
    def parse_dir_contents(self, response):
    ^
IndentationError: unexpected indent

似乎代码yield scrapy.Request(url, self.parse_dir_contents)不起作用,我不知道为什么?

从你的日志:

    def parse_dir_contents(self, response):
    ^
IndentationError: unexpected indent

你有一个识别错误:

    def parse_dir_contents(self, response):
#    def parse(self, response):
        for sel in response.xpath('//*[@id="aspnetForm"]/div[4]'):
            item = CharleschurchItem()

删除注释行或修复标识

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM