繁体   English   中英

Scrapy / Xpath 无法获得 href 元素?

[英]Scrapy / Xpath not working to get href-element?

我尝试从这个站点上抓取一些东西并在 scrapy shell 中工作; https://www.tripadvisor.co.uk/Attraction_Review-g186332-d216481-Reviews-Blackpool_Zoo-Blackpool_Lancashire_England.html

在网站上,我有以下部分代码,我想获取所有这三个 a 元素的 href 信息:

<div class="fvqxY f dlzPP">
  <div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_blank"
      href="http://www.blackpoolzoo.org.uk"><span class="WlYyy cacGK Wb">Visit website</span><svg viewBox="0 0 24 24"
        width="16px" height="16px" class="fecdL d Vb wQMPa">
        <path d="M7.561 15.854l-1.415-1.415 8.293-8.293H7.854v-2h10v10h-2V7.561z"></path>
      </svg></a></div>
  <div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_self"
      href="tel:%2B44%201253%20830830"><span class="WlYyy cacGK Wb">Call</span></a></div>
  <div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_self"
      href="mailto:info@blackpoolzoo.org.uk"><span class="WlYyy cacGK Wb">Email</span></a></div>
</div>

我用这个 xpath 进行了尝试——这在 chrome-inspector 中对我来说很好——但我只得到一个空的结果

>>> response.xpath("//div[@class='Lvkmj']//ancestor::a/@href") 
[] 

我还用 class = "Lvkmj" 检查了第一个 div 并得到了这个结果:

>>> response.xpath("//div[@class='Lvkmj']").get()                                                                   s="WlYyy cacGK Wb">Visit website</s
'<div class="Lvkmj"><a class="bfQwA _G B- _S _T c G_ P0 ddFHE cnvzr bTBvn" rel="nofollow" target="_blank"><span clas 8.293-8.293H7.854v-2h10v10h-2V7.56s="WlYyy cacGK Wb">Visit website</span><svg viewbox="0 0 24 24" width="16px" height="16px" class="fecdL d Vb wQMPa"><path d="M7.561 15.854l-1.415-1.415 8.293-8.293H7.854v-2h10v10h-2V7.561z"></path></svg></a></div>'
>>>

在那里我意识到乍一看它是整个 div 元素 - 但后来我看到它看起来与检查中的完全相同,但无论出于何种原因,href 元素都丢失了。

为什么在这种情况下使用 scapy shell 时缺少 href 元素?

您可以在下面找到完整的代码 -

import scrapy

class ZoosSpider(scrapy.Spider):
  name = 'zoos'
  allowed_domains = ['www.tripadvisor.co.uk']
  start_urls = [
                "https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html",
                "https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html"
              ]

  def parse(self, response):
    tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
    for elem in tmpSEC:
      link = response.urljoin(elem.xpath(".//a/@href").get())   
      yield response.follow(link, callback=self.parseDetails)             

  def parseDetails(self, response):
    tmpName = response.xpath("//h1[@data-automation='mainH1']/text()").get()  
    tmpLink = response.xpath("//div[@class='Lvkmj']//ancestor::a/@href ").getall()
    tmpErg = response.xpath("//div[@class='dlzPP']//ancestor::div[@class='WlYyy diXIH dDKKM']/text()").getall()
    
    yield {
      "cat": tmpErg[1],
      "link": tmpLink,
      "name": tmpName ,
    }

您的 XPath:

//div[@class='Lvkmj']//ancestor::a/@href

显示结果...因为您的第二个//告诉 XPath 引擎:找到当前节点的任何后代节点,然后ancestor::a告诉引擎找到任何名为 a 的祖先元素。 而且因为 a 确实有后代,所以您的 XPath 给出了结果....但是有一个更好的方法:只需使用:

//div[@class='Lvkmj']/a/@href

/a意思是:给我一个名为a的直接child div[@class='Lvkmj']

但这并不能解决您的问题。

您的问题:为什么在这种情况下使用 scapy shell 时缺少 href 元素?

因为我认为它只使用文档的源而不是更新的(通过 javascript)dom。

如果它会使用更新的 dom 你的行

tmpLink = response.xpath("//div[@class='Lvkmj']//ancestor::a/@href ").getall()

返回一个字符串数组。 因此,您必须循环抛出结果,或者如果您只对第一个结果感兴趣,请使用:

tmpLink = response.xpath("//div[@class='Lvkmj']/a/@href ").get()

由于@Siebe Jongebloed 的回答(没有结果 - 因为似乎发生了一些 javascript dom-changes)我尝试使用 scrapy_selenium 来获取数据 -

所以我将代码更改为:

import scrapy
from shutil import which

from scrapy_selenium import SeleniumRequest

SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = r"C:\Users\Polzi\Documents\DEV\Python-Private\chromedriver.exe"
SELENIUM_DRIVER_ARGUMENTS=['--headless', "--no-sandbox", "--disable-gpu"]
DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

class ZoosSpider(scrapy.Spider):
  name = 'zoos'
  allowed_domains = ['www.tripadvisor.co.uk']
  start_urls = [
                "https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html"
                ]  
  existList = []  

  def parse(self, response):
    tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
    for elem in tmpSEC:
      link = response.urljoin(elem.xpath(".//a/@href").get())   
      yield SeleniumRequest(url=link, callback=self.parseDetails)  

  def parseDetails(self, response):
    tmpName = response.xpath("//h1[@data-automation='mainH1']/text()").get()  
    tmpLink = response.xpath("//div[@class='Lvkmj']/a/@href").getall()    
    
    yield {
      "name": tmpName ,
      "HREFs": tmpLink
    }

但是 HREFs-result 列表仍然是空的......

@Rapid1898 这是迄今为止使用 SeleniumRequest 的工作解决方案

import scrapy

from scrapy_selenium import SeleniumRequest


class ZoosSpider(scrapy.Spider):
    name = 'zoos'
    allowed_domains = ['www.tripadvisor.co.uk']

    def start_requests(self):
        yield SeleniumRequest(
            url="https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html",
             wait_time= 3,
             callback=self.parse)
    def parse(self, response):
        tmpSEC = response.xpath( "//section[@data-automation='AppPresentation_SingleFlexCardSection']")
        for elem in tmpSEC:
            link = response.urljoin(elem.xpath(".//a/@href").get())
            yield SeleniumRequest(url=link, callback=self.parseDetails)

    def parseDetails(self, response):
        tmpName = response.xpath("//h1[@data-automation='mainH1']/text()").get()  
        tmpLink = response.xpath("//div[@class='Lvkmj']/a/@href").getall()    
    
        yield {
        "name": tmpName ,
        "HREFs": tmpLink
        }

设置.py 文件:

#Middleware

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

#Selenium
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
# '--headless' if using chrome instead of firefox
SELENIUM_DRIVER_ARGUMENTS = ['--headless']

Output:

{'name': 'Sandown Park Racecourse', 'HREFs': ['http://www.sandown.co.uk/', 'tel:%2B44%201372%20464348', 'mailto:sandown.ticketing@thejockeyclub.co.uk']}
2021-11-18 00:55:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g656899-d13201486-Reviews-Gala_Bingo-Cramlington_Northumberland_England.html>
{'name': 'Gala Bingo', 'HREFs': ['https://www.galabingoclubs.co.uk/club/cramlington.html', 
'tel:%2B44%201670%20739739', 'mailto:Cramlington.club@galaleisure.com']}
2021-11-18 00:55:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g580427-d2663587-Reviews-Romford_Greyhound_Stadium-Romford_Greater_London_England.html>
{'name': 'Romford Greyhound Stadium', 'HREFs': []}
2021-11-18 00:55:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g190818-d7364032-Reviews-Wetherby_Racecourse-Wetherby_Leeds_West_Yorkshire_England.html>
{'name': 'Wetherby Racecourse', 'HREFs': ['http://www.wetherbyracing.co.uk', 'tel:%2B44%201937%20582035']}
2021-11-18 00:55:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g580423-d3427163-Reviews-Pontefract_Races-Pontefract_West_Yorkshire_England.html>
{'name': 'Pontefract Races', 'HREFs': ['http://www.pontefract-races.co.uk/', 'tel:%2B44%201977%20781307', 'mailto:info@pontefract-races.co.uk']}
2021-11-18 00:55:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g190792-d3250215-Reviews-Cartmel_Racecourse-Grange_over_Sands_Lake_District_Cumbria_England.html>
{'name': 'Cartmel Racecourse', 'HREFs': ['http://www.cartmel-racecourse.co.uk', 'tel:%2B44%2015395%2036340', 'mailto:info@cartmel-racecourse.co.uk']}
2021-11-18 00:55:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Attraction_Review-g186332-d7692268-Reviews-Coral_Island_Blackpool-Blackpool_Lancashire_England.html>
{'name': 'Coral Island Blackpool', 'HREFs': []}
2021-11-18 00:55:53 [scrapy.core.engine] INFO: Closing spider (finished)
2021-11-18 00:55:53 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:60913/session/8821f802ba0aeaa844dec796ad9187b3 {}
2021-11-18 00:55:53 [urllib3.connectionpool] DEBUG: http://127.0.0.1:60913 "DELETE /session/8821f802ba0aeaa844dec796ad9187b3 HTTP/1.1" 200 14
2021-11-18 00:55:53 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-18 00:55:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 19882232,
 'downloader/response_count': 31,
 'downloader/response_status_count/200': 31

..很快

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM