簡體   English   中英

Scrapy / Xpath 無法獲得 href 元素?

[英]Scrapy / Xpath not working to get href-element?

我嘗試從這個站點上抓取一些東西並在 scrapy shell 中工作; https://www.tripadvisor.co.uk/Attraction_Review-g186332-d216481-Reviews-Blackpool_Zoo-Blackpool_Lancashire_England.html

在網站上,我有以下部分代碼,我想獲取所有這三個 a 元素的 href 信息:

<div class="fvqxY f dlzPP">
  <div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_blank"
      href="http://www.blackpoolzoo.org.uk"><span class="WlYyy cacGK Wb">Visit website</span><svg viewBox="0 0 24 24"
        width="16px" height="16px" class="fecdL d Vb wQMPa">
        <path d="M7.561 15.854l-1.415-1.415 8.293-8.293H7.854v-2h10v10h-2V7.561z"></path>
      </svg></a></div>
  <div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_self"
      href="tel:%2B44%201253%20830830"><span class="WlYyy cacGK Wb">Call</span></a></div>
  <div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_self"
      href="mailto:info@blackpoolzoo.org.uk"><span class="WlYyy cacGK Wb">Email</span></a></div>
</div>

我用這個 xpath 進行了嘗試——這在 chrome-inspector 中對我來說很好——但我只得到一個空的結果

>>> response.xpath("//div[@class='Lvkmj']//ancestor::a/@href") 
[] 

我還用 class = "Lvkmj" 檢查了第一個 div 並得到了這個結果:

>>> response.xpath("//div[@class='Lvkmj']").get()                                                                   s="WlYyy cacGK Wb">Visit website</s
'<div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_blank"><span clas 8.293-8.293H7.854v-2h10v10h-2V7.56s="WlYyy cacGK Wb">Visit website</span><svg viewbox="0 0 24 24" width="16px" height="16px" class="fecdL d Vb wQMPa"><path d="M7.561 15.854l-1.415-1.415 8.293-8.293H7.854v-2h10v10h-2V7.561z"></path></svg></a></div>'
>>>

在那里我意識到乍一看它是整個 div 元素 - 但后來我看到它看起來與檢查中的完全相同,但無論出於何種原因,href 元素都丟失了。

為什么在這種情況下使用 scapy shell 時缺少 href 元素?

您可以在下面找到完整的代碼 -

import scrapy

class ZoosSpider(scrapy.Spider):
  name = 'zoos'
  allowed_domains = ['www.tripadvisor.co.uk']
  start_urls = [
                "https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html",
                "https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html"
              ]

  def parse(self, response):
    tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
    for elem in tmpSEC:
      link = response.urljoin(elem.xpath(".//a/@href").get())   
      yield response.follow(link, callback=self.parseDetails)             

  def parseDetails(self, response):
    tmpName = response.xpath("//h1[@data-automation='mainH1']/text()").get()  
    tmpLink = response.xpath("//div[@class='Lvkmj']//ancestor::a/@href ").getall()
    tmpErg = response.xpath("//div[@class='dlzPP']//ancestor::div[@class='WlYyy diXIH dDKKM']/text()").getall()
    
    yield {
      "cat": tmpErg[1],
      "link": tmpLink,
      "name": tmpName ,
    }

您的 XPath:

//div[@class='Lvkmj']//ancestor::a/@href

顯示結果...因為您的第二個//告訴 XPath 引擎:找到當前節點的任何后代節點,然后ancestor::a告訴引擎找到任何名為 a 的祖先元素。 而且因為 a 確實有后代,所以您的 XPath 給出了結果....但是有一個更好的方法:只需使用:

//div[@class='Lvkmj']/a/@href

/a意思是:給我一個名為a的直接child div[@class='Lvkmj']

但這並不能解決您的問題。

您的問題:為什么在這種情況下使用 scapy shell 時缺少 href 元素?

因為我認為它只使用文檔的源而不是更新的(通過 javascript)dom。

如果它會使用更新的 dom 你的行

tmpLink = response.xpath("//div[@class='Lvkmj']//ancestor::a/@href ").getall()

返回一個字符串數組。 因此,您必須循環拋出結果,或者如果您只對第一個結果感興趣,請使用:

tmpLink = response.xpath("//div[@class='Lvkmj']/a/@href ").get()

由於@Siebe Jongebloed 的回答(沒有結果 - 因為似乎發生了一些 javascript dom-changes)我嘗試使用 scrapy_selenium 來獲取數據 -

所以我將代碼更改為:

import scrapy
from shutil import which

from scrapy_selenium import SeleniumRequest

SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = r"C:\Users\Polzi\Documents\DEV\Python-Private\chromedriver.exe"
SELENIUM_DRIVER_ARGUMENTS=['--headless', "--no-sandbox", "--disable-gpu"]
DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

class ZoosSpider(scrapy.Spider):
  name = 'zoos'
  allowed_domains = ['www.tripadvisor.co.uk']
  start_urls = [
                "https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html"
                ]  
  existList = []  

  def parse(self, response):
    tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
    for elem in tmpSEC:
      link = response.urljoin(elem.xpath(".//a/@href").get())   
      yield SeleniumRequest(url=link, callback=self.parseDetails)  

  def parseDetails(self, response):
    tmpName = response.xpath("//h1[@data-automation='mainH1']/text()").get()  
    tmpLink = response.xpath("//div[@class='Lvkmj']/a/@href").getall()    
    
    yield {
      "name": tmpName ,
      "HREFs": tmpLink
    }

但是 HREFs-result 列表仍然是空的......

@Rapid1898 這是迄今為止使用 SeleniumRequest 的工作解決方案

import scrapy

from scrapy_selenium import SeleniumRequest


class ZoosSpider(scrapy.Spider):
    name = 'zoos'
    allowed_domains = ['www.tripadvisor.co.uk']

    def start_requests(self):
        yield SeleniumRequest(
            url="https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html",
             wait_time= 3,
             callback=self.parse)
    def parse(self, response):
        tmpSEC = response.xpath( "//section[@data-automation='AppPresentation_SingleFlexCardSection']")
        for elem in tmpSEC:
            link = response.urljoin(elem.xpath(".//a/@href").get())
            yield SeleniumRequest(url=link, callback=self.parseDetails)

    def parseDetails(self, response):
        tmpName = response.xpath("//h1[@data-automation='mainH1']/text()").get()  
        tmpLink = response.xpath("//div[@class='Lvkmj']/a/@href").getall()    
    
        yield {
        "name": tmpName ,
        "HREFs": tmpLink
        }

設置.py 文件:

#Middleware

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

#Selenium
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
# '--headless' if using chrome instead of firefox
SELENIUM_DRIVER_ARGUMENTS = ['--headless']

Output:

{'name': 'Sandown Park Racecourse', 'HREFs': ['http://www.sandown.co.uk/', 'tel:%2B44%201372%20464348', 'mailto:sandown.ticketing@thejockeyclub.co.uk']}
2021-11-18 00:55:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g656899-d13201486-Reviews-Gala_Bingo-Cramlington_Northumberland_England.html>
{'name': 'Gala Bingo', 'HREFs': ['https://www.galabingoclubs.co.uk/club/cramlington.html', 
'tel:%2B44%201670%20739739', 'mailto:Cramlington.club@galaleisure.com']}
2021-11-18 00:55:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g580427-d2663587-Reviews-Romford_Greyhound_Stadium-Romford_Greater_London_England.html>
{'name': 'Romford Greyhound Stadium', 'HREFs': []}
2021-11-18 00:55:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g190818-d7364032-Reviews-Wetherby_Racecourse-Wetherby_Leeds_West_Yorkshire_England.html>
{'name': 'Wetherby Racecourse', 'HREFs': ['http://www.wetherbyracing.co.uk', 'tel:%2B44%201937%20582035']}
2021-11-18 00:55:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g580423-d3427163-Reviews-Pontefract_Races-Pontefract_West_Yorkshire_England.html>
{'name': 'Pontefract Races', 'HREFs': ['http://www.pontefract-races.co.uk/', 'tel:%2B44%201977%20781307', 'mailto:info@pontefract-races.co.uk']}
2021-11-18 00:55:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g190792-d3250215-Reviews-Cartmel_Racecourse-Grange_over_Sands_Lake_District_Cumbria_England.html>
{'name': 'Cartmel Racecourse', 'HREFs': ['http://www.cartmel-racecourse.co.uk', 'tel:%2B44%2015395%2036340', 'mailto:info@cartmel-racecourse.co.uk']}
2021-11-18 00:55:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g186332-d7692268-Reviews-Coral_Island_Blackpool-Blackpool_Lancashire_England.html>
{'name': 'Coral Island Blackpool', 'HREFs': []}
2021-11-18 00:55:53 [scrapy.core.engine] INFO: Closing spider (finished)
2021-11-18 00:55:53 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:60913/session/8821f802ba0aeaa844dec796ad9187b3 {}
2021-11-18 00:55:53 [urllib3.connectionpool] DEBUG: http://127.0.0.1:60913 "DELETE /session/8821f802ba0aeaa844dec796ad9187b3 HTTP/1.1" 200 14
2021-11-18 00:55:53 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 00:55:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 19882232,
 'downloader/response_count': 31,
 'downloader/response_status_count/200': 31

..很快

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM