Python parse HTML with escape characters

Question

I am trying to scrape data from a website, but the data table is rendered by JavaScript. Instead of using a tool like Selenium to generate the page and run the script, I have instead found the script tag where the data is stored and am trying to pull the data directly from there.

Here is the code:

import requests
from bs4 import BeautifulSoup
import json

url = 'https://www.etf.com/SPY'

result = requests.get(url)

c = result.content
html = BeautifulSoup(c, 'html.parser')

script = html.find_all('script')[-22]   #this is the script tag that has the data

script = script.contents

js = script[0]
data = js[31:-2]  #data is the json/dict which has the data

This is a snippet of what the contents of data looks like:

s = json.loads(data)

s = s['etf_report_from_api']['modalInfoToActive']['top10Holdings']['data']

s = s[13:-2]

Here is a snippet of what s looks like:

At this point the content is looking more like HTML, but it seems like the escape characters have not been unescaped properly

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()

Here is the output of the parser. it seems to be able to recognize certain tags but is identifying others as data due to the formatting issue.

This data is essentially an HTML table, but how can I properly decode/parse it to extract the data contents?

Answer 1

It looks to me like you simply need to unescape " and / values in your string s , and then you can successfully parse the markup with bs4 :

soup = BeautifulSoup(s.replace(r"\"", '"').replace(r"\/", "/"), "html.parser")

for row in soup.find_all("tr"):
    name, value = row.find_all("td")
    print(f"{name.text}\t{value.text}")

Result:

Microsoft Corporation   3.55%
Apple Inc.  3.31%
Amazon.com, Inc.    3.11%
Facebook, Inc. Class A  1.76%
Berkshire Hathaway Inc. Class B 1.76%
...

Python parse HTML with escape characters

Question

1 answers

solution1
1 ACCPTED 2019-02-02 20:38:56

Python parse HTML with escape characters

Question

1 answers

solution1 1 ACCPTED 2019-02-02 20:38:56

solution1
1 ACCPTED 2019-02-02 20:38:56