简体   繁体   English

Scrapy xpath 不提取包含特殊字符的 div <%=

[英]Scrapy xpath not extracting div containing special characters <%=

I am new to Scrapy. I am trying to extract the h2 text from the following URL: 'https://www.tysonprop.co.za/agents/'我是 Scrapy 的新手。我正在尝试从以下 URL 中提取 h2 文本:'https://www.tysonprop.co.za/agents/'

I have 2 problems:我有两个问题:

  1. My xpath can get to the script element, but it cannot find the h2 or the div elements inside the script tag.我的 xpath 可以找到 script 元素,但它找不到 script 标签内的 h2 或 div 元素。 I've even tried saving the HTML file to my machine and scraping this file, but the same problem occurs.我什至尝试将 HTML 文件保存到我的机器并抓取该文件,但出现了同样的问题。 I have triple checked my xpath code, all seems in order.我已经三次检查了我的 xpath 代码,一切似乎都井井有条。

  2. When the website is displayed in my browser, branch.branch_name resolves to "Tysen Properties Head Office".当网站显示在我的浏览器中时,branch.branch_name 解析为“Tysen Properties Head Office”。 How would one get the value (ie "Tysen Properties Head Office") instead of the variable name (branch.branch_name)?如何获得值(即“Tysen Properties Head Office”)而不是变量名称(branch.branch_name)?

My Python code:我的 Python 代码:

import scrapy

class TysonSpider(scrapy.Spider):
    name = 'tyson_spider'

    def start_requests(self):
        url = 'https://www.tysonprop.co.za/agents/'
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):

        script = response.xpath('//script[@id="id_branch_template"]')
        div = script.xpath('./div[contains(@class,"branch-container")]')
        h2 = div.xpath('/h2[contains(@class,"branch-name")]/text()').extract()
        yield {'branchName': h2}

HTML extract below: HTML 摘录如下:

<script type="text/html" id="id_branch_template">
  <div id="branch-<%= branch.id %>" class="clearfix margin-top30 branch-container" style="display: none;">
    <h2 class="grid_12 branch-name margin-bottom20"><%= branch.branch_name %></h2>
    <div class="branch-agents container_12 first last clearfix">
      <div id="agents-list-left" class="agents-list left grid_6">
      </div>
      <div id="agents-list-right" class="agents-list right grid_6">
      </div>
    </div>
  </div>
</script>

Does branch.branch_name looks like a address in JSON format? branch.branch_name看起来像 JSON 格式的地址吗? is there a call which loads data you are looking for?是否有加载您要查找的数据的调用? maybe, let's see也许,让我们看看

By looking through your browser developer tool you can find requests in.network tab and by searching between them you will face this AJAX call which loads exactly the data you are looking for.通过查看您的浏览器开发人员工具,您可以在 .network 选项卡中找到请求,通过在它们之间搜索,您将面临这个 AJAX 调用,它会加载您正在寻找的数据。 so:所以:

import json
import scrapy
class TysonSpider(scrapy.Spider):
    name = 'tyson_spider'
    def start_requests(self):
        url = 'https://www.tysonprop.co.za/ajax/agents/?branch_id=25'
        yield scrapy.Request(url=url, callback=self.parse)
    def parse(self, response):
        json_data = json.loads(response.text)
        branch_name = json_data['branch']['branch_name']
        yield {'branchName': branch_name}

The div inside script tag it is a text. script标签内的div是一个文本。 To get it as html, you can do following:要将其作为 html,您可以执行以下操作:

from scrapy.selector import Selector

....
def parse(self, response):

        script = Selector(text=response.xpath('//script[@id="id_branch_template"]/text()').get())
        div = script.xpath('./div[contains(@class,"branch-container")]')
        h2 = div.xpath('.//h2[contains(@class,"branch-name")]/text()').extract()
        yield {'branchName': h2}

But please NOTE, the h2 doesn't contain any text, so you result will be an empty array但请注意, h2不包含任何文本,因此您的结果将是一个空数组

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM