Scrapy-Xpath在shell中有效，但在代碼中無效

Question

我試圖爬網一個網站（我獲得了他們的授權），並且我的代碼返回了我想要的外殼中的內容，但是我的蜘蛛卻什么也沒有。

我還檢查了所有與該問題類似的先前問題，但均未成功，例如，該網站未在首頁中使用javascript加載所需的元素。

import scrapy


class MySpider(scrapy.Spider):
    name = 'MySpider'

    start_urls = [ #WRONG URL, SHOULD BE https://shop.app4health.it/ PROBLEM SOLVED!
        'https://www.app4health.it/',
    ]

    def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)
        print ('PRE RISULTATI')

        results =  response.selector.xpath('//*[@id="nav"]/ol/li[*]/a/@href').extract()
        # results = response.css('li a>href').extract()


        # This works on scrapy shell, not in code
        #risultati =  response.xpath('//*[@id="nav"]/ol/li[1]/a').extract()
        print (risultati)




        #for pagineitems in risultati:
               # next_page = pagineitems 
        print ('NEXT PAGE')
        #Ignores the request cause already done. Insert dont filter
        yield scrapy.Request(url=risultati, callback=self.prodotti,dont_filter = True)

    def prodotti(self, response):
        self.logger.info('A REEEESPONSEEEEEE from %s just arrived!', response.url)
        return 1

我要抓取的網站是https://shop.app4health.it/

我使用的xpath命令是這個：

response.selector.xpath('//*[@id="nav"]/ol/li[*]/a/@href').extract()

我知道prodotti函數ecc有一些問題，但這不是重點。 我想了解為什么xpath選擇器可用於scrapy shell（我確切地獲得了我需要的鏈接），但是當我在自己的Spider中運行它時，我總是得到一個空列表。

如果有幫助，當我在我的Spider中使用CSS選擇器時，它可以正常工作並找到元素，但是我想使用xpath（在我的應用程序的未來開發中需要它）。

謝謝您的幫助：）

編輯：我試圖打印第一個響應的正文（從start_urls），它是正確的，我得到了我想要的頁面。 當我在代碼中使用選擇器（甚至是建議的選擇器）時，它們在shell中都可以正常工作，但是我的代碼卻什么也沒有！

編輯2我已經對Scrapy和Web爬網有了更多的經驗，我意識到有時在瀏覽器中獲得的HTML頁面可能與通過Scrapy請求獲得的HTML頁面有所不同！ 根據我的經驗，與您在瀏覽器中看到的網站相比，某些網站會以不同的HTML響應！ 這就是為什么有時如果您使用從瀏覽器獲取的“正確” xpath / css查詢，如果在您的Scrapy代碼中使用的話，它可能不返回任何內容。 始終檢查您的回復內容是否符合您的期望！

求助：路徑正確。 我寫錯了start_urls！

Answer 1

    //nav[@id="mmenu"]//ul/li[contains(@class,"level0")]/a[contains(@class,"level-top")]/@href

使用此xpath，在創建xpath之前也請考慮頁面的“視圖源”

Answer 2

除了Desperado的答案之外，您還可以使用css選擇器，該選擇器要簡單得多，但對於您的用例來說已經足夠了：

$ scrapy shell "https://shop.app4health.it/"
In [1]: response.css('.level0 .level-top::attr(href)').extract()
Out[1]: 
['https://shop.app4health.it/sonno',
 'https://shop.app4health.it/monitoraggio-e-diagnostica',
 'https://shop.app4health.it/terapia',
 'https://shop.app4health.it/integratori-alimentari',
 'https://shop.app4health.it/fitness',
 'https://shop.app4health.it/benessere',
 'https://shop.app4health.it/ausili',
 'https://shop.app4health.it/prodotti-in-offerta',
 'https://shop.app4health.it/kit-regalo']

scrapy shell命令非常適合調試此類問題。

Scrapy-Xpath在shell中有效，但在代碼中無效

問題描述

2 個解決方案

解決方案1
0 2018-04-25 07:16:30

解決方案2
0 已采納 2018-04-25 07:21:27

Scrapy-Xpath在shell中有效，但在代碼中無效

問題描述

2 個解決方案

解決方案1 0 2018-04-25 07:16:30

解決方案2 0 已采納 2018-04-25 07:21:27

解決方案1
0 2018-04-25 07:16:30

解決方案2
0 已采納 2018-04-25 07:21:27