Scrapy：從腳本標簽中提取數據

Question

我是 Scrapy 的新手。 我正在嘗試從“https://www.tysonprop.co.za/agents/”中抓取內容以用於工作目的。

特別是，我要找的信息似乎是由腳本標簽生成的。

行：<%= branch.branch_name %> 在運行時解析為： Tyson Properties Head Office。

我試圖訪問在運行時在 h2 元素內生成的文本。

然而，Scrapy 響應對象似乎抓取了原始源代碼。 即我想要的數據顯示為 <%= branch.branch_name %> 而不是“Tyson Properties Head Office”。

任何幫助，將不勝感激。

HTML 響應對象提取：

<script type="text/html" id="id_branch_template">
  <div id="branch-<%= branch.id %>" class="clearfix margin-top30 branch-container" style="display: none;">
    <h2 class="grid_12 branch-name margin-bottom20"><%= branch.branch_name %></h2>
    <div class="branch-agents container_12 first last clearfix">
      <div id="agents-list-left" class="agents-list left grid_6">
      </div>
      <div id="agents-list-right" class="agents-list right grid_6">
      </div>
    </div>
  </div>
</script>
<script type="text/javascript">

當前的 Scrapy 蜘蛛代碼：

import scrapy
from scrapy.crawler import CrawlerProcess

class TysonSpider(scrapy.Spider):
    name = 'tyson_spider'

    def start_requests(self):
        url = 'https://www.tysonprop.co.za/agents/'
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):

        script = response.xpath('//script[@id="id_branch_template"]/text()').get()
        div = scrapy.Selector(text=script).xpath('//div[contains(@class,"branch-container")]')
        h2 = div.xpath('./h2[contains(@class,"branch-name")]')

與此問題相關： Scrapy xpath not extracting div contains special characters <%=

Answer 1

正如相關問題的公認答案所暗示的那樣，請考慮使用 AJAX 端點。

如果這對您不起作用，請考慮使用Splash 。 數據似乎是用 AJAX 下載的，然后用 JavaScript 添加到頁面中。 Scrapy 可以使用 Splash 在頁面上執行 JS。

例如，這之后應該可以正常工作。

h2 = div.xpath('./h2[contains(@class,"branch-name")]')

文檔中有安裝 Splash 的說明，但在啟動並運行后，對實際爬蟲的代碼更改非常少。

安裝scrapy-splash

pip install scrapy-splash

在settings.py（在Github頁面上列出）添加一些配置，最后使用SplashRequest代替scrapy.Request。

如果這對您不起作用，請查看Selenium或Pyppeteer 。

此外，HTML 響應在執行 JS（即在腳本內）之前沒有“Tyson Properties Head Office”，除了作為下拉菜單項，這可能不是那么有用，因此無法從回復。

Scrapy：從腳本標簽中提取數據

問題描述

HTML 響應對象提取：

1 個解決方案

解決方案1
1 2020-09-23 12:50:06

Scrapy：從腳本標簽中提取數據

問題描述

HTML 響應對象提取：

1 個解決方案

解決方案1 1 2020-09-23 12:50:06

解決方案1
1 2020-09-23 12:50:06