简体   繁体   English

Scrapy:从脚本标签中提取数据

[英]Scrapy: extracting data from script tag

I am new to Scrapy.我是 Scrapy 的新手。 I am trying to scrape contents from 'https://www.tysonprop.co.za/agents/' for work purposes.我正在尝试从“https://www.tysonprop.co.za/agents/”中抓取内容以用于工作目的。

In particular, the information I am looking for seems to be generated by a script tag.特别是,我要找的信息似乎是由脚本标签生成的。

The line: <%= branch.branch_name %> resolves to: Tyson Properties Head Office at run time.行:<%= branch.branch_name %> 在运行时解析为: Tyson Properties Head Office。

I am trying to access the text generated inside the h2 element at run time.我试图访问在运行时在 h2 元素内生成的文本。

However, the Scrapy response object seems to grab the raw source code.然而,Scrapy 响应对象似乎抓取了原始源代码。 Ie the data I want appears as <%= branch.branch_name %> and not "Tyson Properties Head Office".即我想要的数据显示为 <%= branch.branch_name %> 而不是“Tyson Properties Head Office”。

Any help would be appreciated.任何帮助,将不胜感激。

HTML response object extract: HTML 响应对象提取:

<script type="text/html" id="id_branch_template">
  <div id="branch-<%= branch.id %>" class="clearfix margin-top30 branch-container" style="display: none;">
    <h2 class="grid_12 branch-name margin-bottom20"><%= branch.branch_name %></h2>
    <div class="branch-agents container_12 first last clearfix">
      <div id="agents-list-left" class="agents-list left grid_6">
      </div>
      <div id="agents-list-right" class="agents-list right grid_6">
      </div>
    </div>
  </div>
</script>
<script type="text/javascript">

Current Scrapy spider code:当前的 Scrapy 蜘蛛代码:

import scrapy
from scrapy.crawler import CrawlerProcess

class TysonSpider(scrapy.Spider):
    name = 'tyson_spider'

    def start_requests(self):
        url = 'https://www.tysonprop.co.za/agents/'
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):

        script = response.xpath('//script[@id="id_branch_template"]/text()').get()
        div = scrapy.Selector(text=script).xpath('//div[contains(@class,"branch-container")]')
        h2 = div.xpath('./h2[contains(@class,"branch-name")]')

Related to this question: Scrapy xpath not extracting div containing special characters <%=与此问题相关: Scrapy xpath not extracting div contains special characters <%=

As the accepted answer on the related question suggests, consider using the AJAX endpoint.正如相关问题的公认答案所暗示的那样,请考虑使用 AJAX 端点。

If that doesn't work for you, consider using Splash .如果这对您不起作用,请考虑使用Splash The data seems to be downloaded with AJAX and added to the page with JavaScript.数据似乎是用 AJAX 下载的,然后用 JavaScript 添加到页面中。 Scrapy can use Splash to execute JS on the page. Scrapy 可以使用 Splash 在页面上执行 JS。

For example, this should work just fine after that.例如,这之后应该可以正常工作。

h2 = div.xpath('./h2[contains(@class,"branch-name")]')

The docs have instructions for installing Splash but after getting it up and running, code changes to the actual crawler are pretty minimal.文档中有安装 Splash 的说明,但在启动并运行后,对实际爬虫的代码更改非常少。

Install scrapy-splash with安装scrapy-splash

pip install scrapy-splash

Add some configurations to settings.py (listed on the Github page) and finally use SplashRequest instead of scrapy.Request.在settings.py(在Github页面上列出)添加一些配置,最后使用SplashRequest代替scrapy.Request。

If that doesn't work for you, maybe check out Selenium or Pyppeteer .如果这对您不起作用,请查看SeleniumPyppeteer

Also, the HTML response doesn't have "Tyson Properties Head Office" in it before executing JS (ie inside a script) except as a dropdown menu item, which probably isn't that useful, so it can't be extracted from the response.此外,HTML 响应在执行 JS(即在脚本内)之前没有“Tyson Properties Head Office”,除了作为下拉菜单项,这可能不是那么有用,因此无法从回复。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM