简体   繁体   中英

BS4: How do I get a value from invalid json in <script> tag

I am trying to get hlsUrl from this tag.

import json
import re

from bs4 import BeautifulSoup

html_doc = """
    <script>
      var modelData = {
        "hlsUrl": "null",
        "account": "1V2FO4K7ME78RV09VXNEC",
        "packageName": "null",
        isActive: false
      }
  </script>
"""

soup = BeautifulSoup(html_doc, "html.parser")
script_text = soup.select_one("script").string
model_data = re.search(r"modelData = ({.*?})", script_text, re.S).group(1)
print(json.loads(model_data)["account"])

But I am getting this error:

obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 5 column 9 (char 111)

I know it's because it is not valid json because of the isActive: false

How do I turn it into valid json or otherwise to get hlsUrl?

import json
import re

from bs4 import BeautifulSoup

html_doc = """
    <script>
      var modelData = {
        "hlsUrl": "null",
        "account": "1V2FO4K7ME78RV09VXNEC",
        "packageName": "null",
        isActive: false
      }
  </script>
"""

soup = BeautifulSoup(html_doc, "html.parser")
script_text = soup.select_one("script").string
model_data = re.search(r"modelData = ({.*?})", script_text, re.S).group(1)

model_data = re.sub(r"^\s*([^\s\"]+):", r'"\1":', model_data, flags=re.M)  # <-- fix `isActive:` to `"isActive":`

print(json.loads(model_data)["account"])

Prints:

1V2FO4K7ME78RV09VXNEC

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM