简体   繁体   English

如何有效地将 JavaScript Json 解析为 Python 字典类型

[英]How to parse JavaScript Json into Python dict type, effeciently

I am looking for way to read javascript json data loaded into one of a script tag of this page .我正在寻找方法来读取加载到此页面的脚本标记之一中的 javascript json 数据。 I have tried various re patterns posted on google and stackoveflow but got nothing.我尝试了在 google 和 stackoveflow 上发布的各种re模式,但一无所获。

The Json Formatter shows an Invalid (RFC 8259). Json格式化程序显示无效 (RFC 8259)。

Here is a code这是一个代码

import requests,json
from scrapy.selector import Selector

headers = {'Content-Type': 'application/json', 'Accept-Language': 'en-US,en;q=0.5', 'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3'}

url = 'https://www.zocdoc.com/doctor/andrew-fagelman-md-7363?insuranceCarrier=-1&insurancePlan=-1'

response = requests.get(url,headers = headers)

sel = Selector(text = response.text)

profile_data = sel.css('script:contains(APOLLO_STATE)::text').get('{}').split('__REDUX_STATE__ = JSON.parse(')[-1].split(');\n          window.ZD = {')[0]
    
profile_json = json.loads(profile_data)
    
print(type(profile_json))

The problem seems an invalid json format.问题似乎是无效的 json 格式。 The type of profile_json is string while a little amendments in above code shows below error stack profile_json的类型是字符串,而上面代码中的一些修改显示在错误堆栈下方

>>> profile_data = sel.css('script:contains(APOLLO_STATE)::text').get('{}').split('__REDUX_STATE__ = JSON.parse("')[-1].split('");\n          window.ZD = {')[0].replace("\\","")
>>> profile_json = json.loads(profile_data)
Traceback (most recent call last):
  File "/usr/lib/python3.6/code.py", line 91, in runcode
    exec(code, self.locals)
  File "<console>", line 1, in <module>
  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.6/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 41316 (char 41315)

Error in output are highlighted here: output 中的错误在这里突出显示:

json

The original HTML contains this (heavily trimmed):原来的 HTML 包含这个(大量修剪):

<script>
   ...
   window.__REDUX_STATE__ = JSON.parse("{\"routing\": ...
   \"awards\":[\"Journal of Urology - \\\"Efficacy, Safety, and Use of Viagra in Clinical Practice.\\\"\",\"Critical Care Resident of the Year - 2003\"],
   ...

The same string extracted by scrapy is this: scrapy 提取的相同字符串是这样的:

"awards":[
               "Journal of Urology - ""Efficacy",
               "Safety",
               "and Use of Viagra in Clinical Practice.""",
               "Critical Care Resident of the Year - 2003"
            ],

It appears the backslashes are removed from it, making the JSON invalid.似乎反斜杠已从中删除,使 JSON 无效。

I don't know if this is an efficient way of handling the problem but below code resolved my problem.我不知道这是否是处理问题的有效方法,但下面的代码解决了我的问题。

>>> import js2xml
>>> profile_data = sel.css('script:contains(APOLLO_STATE)::text').get('{}')
>>> parsed = js2xml.parse(profile_data)
>>> js = json.loads(parsed.xpath("//string[contains(text(),'routing')]/text()")[0])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM