使用python從網站解析json文件

Question

我想解析並保存嵌入html代碼的json文件的內容。 但是，當我隔離相關的字符串並嘗試用json包加載它時，會收到錯誤JSONDecodeError: Extra data ，我不確定是什么引起的。

有人建議相關的代碼實際上可以包含多個詞典，這可能會有問題，但是如果這是真的，我不清楚如何進行處理。 我的代碼在下面提供。 任何建議，不勝感激！

from bs4 import BeautifulSoup
import urllib.request 
from urllib.request import HTTPError
import csv
import json
import re

def left(s, amount):
    return s[:amount]

def right(s, amount):
    return s[-amount:]

def mid(s, offset, amount):
    return s[offset:offset+amount]
url= "url"
from urllib.request import Request, urlopen
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
try:
    s = urlopen(req,timeout=200).read()
except urllib.request.HTTPError as e:
    print(str(e))  
soup = BeautifulSoup(s, "lxml")
tables=soup.find_all("script")
for i in range(0,len(tables)):
    if str(tables[i]).find("TimeLine.init")>-1:
        dat=str(tables[i]).splitlines()
        for tbl in dat:
            if str(tbl).find("TimeLine.init")>-1:
                s=str(tbl).strip()
j=json.loads(s)

Answer 1

您正在嘗試解析如下所示的字符串：

FieldView.TimeLine.init( <first parameter - a json array>, <second parameter - a json array>, <third parameter, a json object>, true, 4, "29:58", 1798);

尖括號<和>僅在此處分組，它們沒有特殊含義，實際上並不存在。

您將無法正確解析它，因為它不是有效的json。 取而代之的是，剝離函數調用並添加方括號，以使函數的參數包裝在json數組中。

json.loads("[{:s}]".format(str(dat[4]).strip()[24:-2])

Answer 2

您可以使用JSON自己的異常報告來幫助進行解析，該分析給出了loads()失敗的位置，例如：

Extra data: line 1 column 1977 (char 1976)

以下腳本首先找到所有javascript <script>標記，然后在每個javascript內部尋找函數。 然后，它找到JSON文本的外部開始和結尾。 然后，它嘗試對此進行解碼，記下失敗的偏移量，跳過此字符，然后重試。 找到最后一塊后，它將成功解碼。 然后，它在每個有效塊上調用loads() ，並將結果存儲在json_decoded ：

from bs4 import BeautifulSoup
from urllib.request import HTTPError, Request, urlopen
import csv
import json

url = "url"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

try:
    s = urlopen(req, timeout=200).read()
except urllib.request.HTTPError as e:
    print(str(e))  

json_decoded = []
soup = BeautifulSoup(s, "lxml")

for script in soup.find_all("script", attrs={"type" : "text/javascript"}):
    text = script.text
    search = 'FieldView.TimeLine.init('
    field_start = text.find(search)

    if field_start != -1:
        # Find the start and end of the JSON in the function
        json_offsets = []
        json_start = field_start + len(search)
        json_end = text.rfind('}', 0, text.find(');', json_start)) + 1

        # Extract JSON
        json_text = text[json_start : json_end]

        # Attempt to decode, and record the offsets of where the decode fails
        offset = 0

        while True:
            try:
                dat = json.loads(json_text[offset:])
                break
            except json.decoder.JSONDecodeError as e:
                # Extract failed location from the exception report
                failed_at = int(re.search(r'char\s*(\d+)', str(e)).group(1))
                offset = offset + failed_at + 1
                json_offsets.append(offset)

        # Extract each valid block and decode it to a list
        cur_offset = 0

        for offset in json_offsets:
            json_block = json_text[cur_offset : offset - 1]
            json_decoded.append(json.loads(json_block))
            cur_offset = offset

print(json_decoded)

這導致json_decoded擁有兩個JSON條目。

使用python從網站解析json文件

問題描述

2 個解決方案

解決方案1
1 2016-11-20 13:31:56

解決方案2
1 已采納 2016-11-21 09:15:16

使用python從網站解析json文件

問題描述

2 個解決方案

解決方案1 1 2016-11-20 13:31:56

解決方案2 1 已采納 2016-11-21 09:15:16

解決方案1
1 2016-11-20 13:31:56

解決方案2
1 已采納 2016-11-21 09:15:16