简体   繁体   English

使用python从网站解析json文件

[英]Parse json file from website with python

I am looking to parse and save the contents of json file which is embedded in the html code. 我想解析并保存嵌入html代码的json文件的内容。 However when I isolate the relevant string and try and load it with json package I receive an error JSONDecodeError: Extra data and I am unsure what is causing this. 但是,当我隔离相关的字符串并尝试用json包加载它时,会收到错误JSONDecodeError: Extra data ,我不确定是什么引起的。

It was suggested that the relevant code actually could contain multiple dictionaries and this might be problematic, but I'm not clear on how to proceed if this is true. 有人建议相关的代码实际上可以包含多个词典,这可能会有问题,但是如果这是真的,我不清楚如何进行处理。 My code is provided below. 我的代码在下面提供。 Any suggestions much appreciated! 任何建议,不胜感激!

from bs4 import BeautifulSoup
import urllib.request 
from urllib.request import HTTPError
import csv
import json
import re

def left(s, amount):
    return s[:amount]

def right(s, amount):
    return s[-amount:]

def mid(s, offset, amount):
    return s[offset:offset+amount]
url= "url"
from urllib.request import Request, urlopen
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
try:
    s = urlopen(req,timeout=200).read()
except urllib.request.HTTPError as e:
    print(str(e))  
soup = BeautifulSoup(s, "lxml")
tables=soup.find_all("script")
for i in range(0,len(tables)):
    if str(tables[i]).find("TimeLine.init")>-1:
        dat=str(tables[i]).splitlines()
        for tbl in dat:
            if str(tbl).find("TimeLine.init")>-1:
                s=str(tbl).strip()
j=json.loads(s)

You're trying to parse a string that looks like this: 您正在尝试解析如下所示的字符串:

FieldView.TimeLine.init( <first parameter - a json array>, <second parameter - a json array>, <third parameter, a json object>, true, 4, "29:58", 1798);

The angular brackets, < and >, only serve to group here, they have no special meaning and are not actually present. 尖括号<和>仅在此处分组,它们没有特殊含义,实际上并不存在。

You won't be able to parse that properly, because it is not valid json. 您将无法正确解析它,因为它不是有效的json。 Instead, strip the function call and add eg square braces to make the function's parameters wrapped into a json array. 取而代之的是,剥离函数调用并添加方括号,以使函数的参数包装在json数组中。

json.loads("[{:s}]".format(str(dat[4]).strip()[24:-2])

You could use JSON's own exception reporting to help with parsing which gives the location of where the loads() failed, for example: 您可以使用JSON自己的异常报告来帮助进行解析,该分析给出了loads()失败的位置,例如:

Extra data: line 1 column 1977 (char 1976)

The following script first locates the all the javascript <script> tags and looks for the function inside each. 以下脚本首先找到所有javascript <script>标记,然后在每个javascript内部寻找函数。 It then finds the outer start and end of the JSON text. 然后,它找到JSON文本的外部开始和结尾。 With this it then attempts to decode it, notes the failing offset, skips this character and tries again. 然后,它尝试对此进行解码,记下失败的偏移量,跳过此字符,然后重试。 When the final block is found, it will decode succesfully. 找到最后一块后,它将成功解码。 It then calls loads() on each valid block, storing the results in json_decoded : 然后,它在每个有效块上调用loads() ,并将结果存储在json_decoded

from bs4 import BeautifulSoup
from urllib.request import HTTPError, Request, urlopen
import csv
import json

url = "url"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

try:
    s = urlopen(req, timeout=200).read()
except urllib.request.HTTPError as e:
    print(str(e))  

json_decoded = []
soup = BeautifulSoup(s, "lxml")

for script in soup.find_all("script", attrs={"type" : "text/javascript"}):
    text = script.text
    search = 'FieldView.TimeLine.init('
    field_start = text.find(search)

    if field_start != -1:
        # Find the start and end of the JSON in the function
        json_offsets = []
        json_start = field_start + len(search)
        json_end = text.rfind('}', 0, text.find(');', json_start)) + 1

        # Extract JSON
        json_text = text[json_start : json_end]

        # Attempt to decode, and record the offsets of where the decode fails
        offset = 0

        while True:
            try:
                dat = json.loads(json_text[offset:])
                break
            except json.decoder.JSONDecodeError as e:
                # Extract failed location from the exception report
                failed_at = int(re.search(r'char\s*(\d+)', str(e)).group(1))
                offset = offset + failed_at + 1
                json_offsets.append(offset)

        # Extract each valid block and decode it to a list
        cur_offset = 0

        for offset in json_offsets:
            json_block = json_text[cur_offset : offset - 1]
            json_decoded.append(json.loads(json_block))
            cur_offset = offset

print(json_decoded)

This results in json_decoded holding two JSON entries. 这导致json_decoded拥有两个JSON条目。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM