简体   繁体   中英

Python parse HTML with escape characters

I am trying to scrape data from a website, but the data table is rendered by JavaScript. Instead of using a tool like Selenium to generate the page and run the script, I have instead found the script tag where the data is stored and am trying to pull the data directly from there.

Here is the code:

import requests
from bs4 import BeautifulSoup
import json

url = 'https://www.etf.com/SPY'

result = requests.get(url)

c = result.content
html = BeautifulSoup(c, 'html.parser')

script = html.find_all('script')[-22]   #this is the script tag that has the data

script = script.contents

js = script[0]
data = js[31:-2]  #data is the json/dict which has the data

This is a snippet of what the contents of data looks like:

在此处输入图片说明

s = json.loads(data)

s = s['etf_report_from_api']['modalInfoToActive']['top10Holdings']['data']

s = s[13:-2]

Here is a snippet of what s looks like:

在此处输入图片说明

At this point the content is looking more like HTML, but it seems like the escape characters have not been unescaped properly

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()

Here is the output of the parser. it seems to be able to recognize certain tags but is identifying others as data due to the formatting issue.

在此处输入图片说明

This data is essentially an HTML table, but how can I properly decode/parse it to extract the data contents?

It looks to me like you simply need to unescape " and / values in your string s , and then you can successfully parse the markup with bs4 :

soup = BeautifulSoup(s.replace(r"\"", '"').replace(r"\/", "/"), "html.parser")

for row in soup.find_all("tr"):
    name, value = row.find_all("td")
    print(f"{name.text}\t{value.text}")

Result:

Microsoft Corporation   3.55%
Apple Inc.  3.31%
Amazon.com, Inc.    3.11%
Facebook, Inc. Class A  1.76%
Berkshire Hathaway Inc. Class B 1.76%
...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM