简体   繁体   中英

scrapy.Request appears unable to callback a url

I have modified my code to nail down where the error is arising. I am using scrapy and in the first "def parse" i am trying to call a url and then in the next "def" i am trying to then crawl that url.

But, i seem unable to make the scray.Request work, it wont crawl the URL.

import scrapy
#from urllib.parse import urljoin

from CharlesChurch.items import CharleschurchItem

class charleschurchSpider(scrapy.Spider):
    name = "charleschurch"
    allowed_domains = ["charleschurch.com"]
    start_urls = ["https://www.charleschurch.com/sitemap"]

    def parse(self, response):
        # for href in response.xpath('//*[@class="contacts-item"]/ul/li/a/@href'):
        #     url = urljoin('https://www.charleschurch.com/',href.extract())
        #     yield scrapy.Request(url, callback=self.parse_dir_contents)
        url = 'https://www.charleschurch.com/north-yorkshire_harrogate/kingsley-park-10923'
        yield scrapy.Request(url, self.parse_dir_contents)

            
    def parse_dir_contents(self, response):
#    def parse(self, response):
        for sel in response.xpath('//*[@id="aspnetForm"]/div[4]'):
            item = CharleschurchItem()
            item['name'] = sel.xpath('//*[@id="XplodePage_ctl12_dsDetailsSnippet_pDetailsContainer"]/span[1]/b/text()').extract()
            item['address'] = sel.xpath('//*[@id="XplodePage_ctl12_dsDetailsSnippet_pDetailsContainer"]/div/*[@itemprop="postalCode"]/text()').extract()
            plotnames = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__name"]/a/text()').extract()
            plotnames = [plotname.strip() for plotname in plotnames]
            plotids = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__name"]/a/@href').extract()
            plotids = [plotid.strip() for plotid in plotids]
            plotprices = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__price"]/text()').extract()
            plotprices = [plotprice.strip() for plotprice in plotprices]
            result = zip(plotnames, plotids, plotprices)
            for plotname, plotid, plotprice in result:
                item['plotname'] = plotname
                item['plotid'] = plotid
                item['plotprice'] = plotprice
                yield item

the error i get is:

2020-09-08 22:12:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-09-08 22:12:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.charleschurch.com/sitemap> (referer: None)
2020-09-08 22:12:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.charleschurch.com/north-yorkshire_harrogate/kingsley-park-10923> (referer: https://www.charleschurch.com/sitemap)
2020-09-08 22:12:08 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.charleschurch.com/north-yorkshire_harrogate/kingsley-park-10923> (referer: https://www.charleschurch.com/sitemap)
Traceback (most recent call last):
  File "C:\Users\andre\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration: <200 https://www.charleschurch.com/north-yorkshire_harrogate/kingsley-park-10923>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\andre\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 55, in mustbe_deferred
    result = f(*args, **kw)
  File "C:\Users\andre\Anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 60, in process_spider_input
    return scrape_func(response, request, spider)
  File "C:\Users\andre\Anaconda3\lib\site-packages\scrapy\core\scraper.py", line 152, in call_spider
    warn_on_generator_with_return_value(spider, callback)
  File "C:\Users\andre\Anaconda3\lib\site-packages\scrapy\utils\misc.py", line 218, in warn_on_generator_with_return_value
    if is_generator_with_return_value(callable):
  File "C:\Users\andre\Anaconda3\lib\site-packages\scrapy\utils\misc.py", line 203, in is_generator_with_return_value
    tree = ast.parse(dedent(inspect.getsource(callable)))
  File "C:\Users\andre\Anaconda3\lib\ast.py", line 47, in parse
    return compile(source, filename, mode, flags,
  File "<unknown>", line 1
    def parse_dir_contents(self, response):
    ^
IndentationError: unexpected indent

It seems the code yield scrapy.Request(url, self.parse_dir_contents) is not working and i am not sure why?

From your logs:

    def parse_dir_contents(self, response):
    ^
IndentationError: unexpected indent

You have an identation error:

    def parse_dir_contents(self, response):
#    def parse(self, response):
        for sel in response.xpath('//*[@id="aspnetForm"]/div[4]'):
            item = CharleschurchItem()

Remove the commented line or fix the identation

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM