简体   繁体   中英

Scrapy: extracting data from script tag

I am new to Scrapy. I am trying to scrape contents from 'https://www.tysonprop.co.za/agents/' for work purposes.

In particular, the information I am looking for seems to be generated by a script tag.

The line: <%= branch.branch_name %> resolves to: Tyson Properties Head Office at run time.

I am trying to access the text generated inside the h2 element at run time.

However, the Scrapy response object seems to grab the raw source code. Ie the data I want appears as <%= branch.branch_name %> and not "Tyson Properties Head Office".

Any help would be appreciated.

HTML response object extract:

<script type="text/html" id="id_branch_template">
  <div id="branch-<%= branch.id %>" class="clearfix margin-top30 branch-container" style="display: none;">
    <h2 class="grid_12 branch-name margin-bottom20"><%= branch.branch_name %></h2>
    <div class="branch-agents container_12 first last clearfix">
      <div id="agents-list-left" class="agents-list left grid_6">
      </div>
      <div id="agents-list-right" class="agents-list right grid_6">
      </div>
    </div>
  </div>
</script>
<script type="text/javascript">

Current Scrapy spider code:

import scrapy
from scrapy.crawler import CrawlerProcess

class TysonSpider(scrapy.Spider):
    name = 'tyson_spider'

    def start_requests(self):
        url = 'https://www.tysonprop.co.za/agents/'
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):

        script = response.xpath('//script[@id="id_branch_template"]/text()').get()
        div = scrapy.Selector(text=script).xpath('//div[contains(@class,"branch-container")]')
        h2 = div.xpath('./h2[contains(@class,"branch-name")]')

Related to this question: Scrapy xpath not extracting div containing special characters <%=

As the accepted answer on the related question suggests, consider using the AJAX endpoint.

If that doesn't work for you, consider using Splash . The data seems to be downloaded with AJAX and added to the page with JavaScript. Scrapy can use Splash to execute JS on the page.

For example, this should work just fine after that.

h2 = div.xpath('./h2[contains(@class,"branch-name")]')

The docs have instructions for installing Splash but after getting it up and running, code changes to the actual crawler are pretty minimal.

Install scrapy-splash with

pip install scrapy-splash

Add some configurations to settings.py (listed on the Github page) and finally use SplashRequest instead of scrapy.Request.

If that doesn't work for you, maybe check out Selenium or Pyppeteer .

Also, the HTML response doesn't have "Tyson Properties Head Office" in it before executing JS (ie inside a script) except as a dropdown menu item, which probably isn't that useful, so it can't be extracted from the response.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM