简体   繁体   English

如何使用 Python 从脚本中提取 JSON?

[英]How to extract JSON from script with Python?

I am parsing a scraped html page that contains a script with JSON inside.我正在解析一个抓取的 html 页面,该页面包含一个内部带有 JSON 的脚本。 This JSON contains all info I am looking for but I can't figure out how to extract a valid JSON.这个 JSON 包含我正在寻找的所有信息,但我不知道如何提取有效的 JSON。

Minimal example:最小的例子:

my_string = '
        (function(){
          window.__PRELOADED_STATE__ = window.__PRELOADED_STATE__ || [];
          window.__PRELOADED_STATE__.push(
        
           { *placeholder representing valid JSON inside* }
        );
        })()
'

The json inside is valid according to jsonlinter.根据jsonlinter,里面的json是有效的。

The result should be loaded into a dictionary:结果应加载到字典中:

import json
import re
my_json = re.findall(r'.*(?={\").*', my_string)[0] // extract json
data = json.loads(my_json)
// print(data)

regex: https://regex101.com/r/r0OYZ0/1正则表达式: https://regex101.com/r/r0OYZ0/1

This try results in:此尝试导致:

>>> data = json.loads(my_json)
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<console>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 7 (char 6)

How can the JSON be extracted and loaded from the string with Python 3.7.x?如何使用 Python 3.7.x 从字符串中提取和加载 JSON?

you can try to extract this regex, its a very simple case and might not answerto all possible json variations:您可以尝试提取此正则表达式,这是一个非常简单的案例,可能无法回答所有可能的 json 变体:

my_string = '''
        (function(){
          window.__PRELOADED_STATE__ = window.__PRELOADED_STATE__ || [];
          window.__PRELOADED_STATE__.push(
        
            {"tst":{"f":3}}
        );
        })()
'''
result = re.findall(r"push\(([{\[].*\:.*[}\]])\)",string3)[0]
result
>>> '{ "tst":{"f":3}}'

to parse it to dictionary now:现在将其解析为字典:

import json 

dictionary = json.loads(result)
type(dictionary)
>>>dict

The my_string provided here is not valid JSON.此处提供的my_string无效 JSON。 For valid JSON, you can use json.loads(JSON_STRING)对于有效的 JSON,您可以使用json.loads(JSON_STRING)

import json

d = json.loads('{"test":2}')
print(d) # Prints the dictionary `{'test': 2}`

Have a look at the below.看看下面。 Note that { *placeholder representing valid JSON inside* } has to be a valid JSON.请注意, { *placeholder representing valid JSON inside* }必须是有效的 JSON。

my_string = '''
        <script>
            (function(){
              window.__PRELOADED_STATE__ = window.__PRELOADED_STATE__ || [];
              window.__PRELOADED_STATE__.push(
            
               {"foo":["bar1", "bar2"]}
            );
            })()
         </script>
'''

import re, json

my_json = re.findall(r'.*(?={\").*', my_string)[0].strip()
data = json.loads(my_json)
print(data)

Output: Output:

{'foo': ['bar1', 'bar2']}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM