Scrapy xpath 不提取包含特殊字符的 div <%=

Question

我是 Scrapy 的新手。我正在嘗試從以下 URL 中提取 h2 文本：'https://www.tysonprop.co.za/agents/'

我有兩個問題：

我的 xpath 可以找到 script 元素，但它找不到 script 標簽內的 h2 或 div 元素。 我什至嘗試將 HTML 文件保存到我的機器並抓取該文件，但出現了同樣的問題。 我已經三次檢查了我的 xpath 代碼，一切似乎都井井有條。
當網站顯示在我的瀏覽器中時，branch.branch_name 解析為“Tysen Properties Head Office”。 如何獲得值（即“Tysen Properties Head Office”）而不是變量名稱（branch.branch_name）？

我的 Python 代碼：

import scrapy

class TysonSpider(scrapy.Spider):
    name = 'tyson_spider'

    def start_requests(self):
        url = 'https://www.tysonprop.co.za/agents/'
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):

        script = response.xpath('//script[@id="id_branch_template"]')
        div = script.xpath('./div[contains(@class,"branch-container")]')
        h2 = div.xpath('/h2[contains(@class,"branch-name")]/text()').extract()
        yield {'branchName': h2}

HTML 摘錄如下：

<script type="text/html" id="id_branch_template">
  <div id="branch-<%= branch.id %>" class="clearfix margin-top30 branch-container" style="display: none;">
    <h2 class="grid_12 branch-name margin-bottom20"><%= branch.branch_name %></h2>
    <div class="branch-agents container_12 first last clearfix">
      <div id="agents-list-left" class="agents-list left grid_6">
      </div>
      <div id="agents-list-right" class="agents-list right grid_6">
      </div>
    </div>
  </div>
</script>

Answer 1

branch.branch_name看起來像 JSON 格式的地址嗎？ 是否有加載您要查找的數據的調用？ 也許，讓我們看看

通過查看您的瀏覽器開發人員工具，您可以在 .network 選項卡中找到請求，通過在它們之間搜索，您將面臨這個 AJAX 調用，它會加載您正在尋找的數據。 所以：

import json
import scrapy

class TysonSpider(scrapy.Spider):
    name = 'tyson_spider'

    def start_requests(self):
        url = 'https://www.tysonprop.co.za/ajax/agents/?branch_id=25'
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        json_data = json.loads(response.text)
        branch_name = json_data['branch']['branch_name']
        yield {'branchName': branch_name}

Answer 2

script標簽內的div是一個文本。 要將其作為 html，您可以執行以下操作：

from scrapy.selector import Selector

....
def parse(self, response):

        script = Selector(text=response.xpath('//script[@id="id_branch_template"]/text()').get())
        div = script.xpath('./div[contains(@class,"branch-container")]')
        h2 = div.xpath('.//h2[contains(@class,"branch-name")]/text()').extract()
        yield {'branchName': h2}

但請注意， h2不包含任何文本，因此您的結果將是一個空數組

Scrapy xpath 不提取包含特殊字符的 div <%=

問題描述

我的 Python 代碼：

2 個解決方案

解決方案1
1 已采納 2020-09-23 11:01:58

解決方案2
0 2020-09-23 08:54:47

Scrapy xpath 不提取包含特殊字符的 div <%=

問題描述

我的 Python 代碼：

2 個解決方案

解決方案1 1 已采納 2020-09-23 11:01:58

解決方案2 0 2020-09-23 08:54:47

解決方案1
1 已采納 2020-09-23 11:01:58

解決方案2
0 2020-09-23 08:54:47