简体   繁体   中英

Parse json file from website with python

I am looking to parse and save the contents of json file which is embedded in the html code. However when I isolate the relevant string and try and load it with json package I receive an error JSONDecodeError: Extra data and I am unsure what is causing this.

It was suggested that the relevant code actually could contain multiple dictionaries and this might be problematic, but I'm not clear on how to proceed if this is true. My code is provided below. Any suggestions much appreciated!

from bs4 import BeautifulSoup
import urllib.request 
from urllib.request import HTTPError
import csv
import json
import re

def left(s, amount):
    return s[:amount]

def right(s, amount):
    return s[-amount:]

def mid(s, offset, amount):
    return s[offset:offset+amount]
url= "url"
from urllib.request import Request, urlopen
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
try:
    s = urlopen(req,timeout=200).read()
except urllib.request.HTTPError as e:
    print(str(e))  
soup = BeautifulSoup(s, "lxml")
tables=soup.find_all("script")
for i in range(0,len(tables)):
    if str(tables[i]).find("TimeLine.init")>-1:
        dat=str(tables[i]).splitlines()
        for tbl in dat:
            if str(tbl).find("TimeLine.init")>-1:
                s=str(tbl).strip()
j=json.loads(s)

You're trying to parse a string that looks like this:

FieldView.TimeLine.init( <first parameter - a json array>, <second parameter - a json array>, <third parameter, a json object>, true, 4, "29:58", 1798);

The angular brackets, < and >, only serve to group here, they have no special meaning and are not actually present.

You won't be able to parse that properly, because it is not valid json. Instead, strip the function call and add eg square braces to make the function's parameters wrapped into a json array.

json.loads("[{:s}]".format(str(dat[4]).strip()[24:-2])

You could use JSON's own exception reporting to help with parsing which gives the location of where the loads() failed, for example:

Extra data: line 1 column 1977 (char 1976)

The following script first locates the all the javascript <script> tags and looks for the function inside each. It then finds the outer start and end of the JSON text. With this it then attempts to decode it, notes the failing offset, skips this character and tries again. When the final block is found, it will decode succesfully. It then calls loads() on each valid block, storing the results in json_decoded :

from bs4 import BeautifulSoup
from urllib.request import HTTPError, Request, urlopen
import csv
import json

url = "url"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

try:
    s = urlopen(req, timeout=200).read()
except urllib.request.HTTPError as e:
    print(str(e))  

json_decoded = []
soup = BeautifulSoup(s, "lxml")

for script in soup.find_all("script", attrs={"type" : "text/javascript"}):
    text = script.text
    search = 'FieldView.TimeLine.init('
    field_start = text.find(search)

    if field_start != -1:
        # Find the start and end of the JSON in the function
        json_offsets = []
        json_start = field_start + len(search)
        json_end = text.rfind('}', 0, text.find(');', json_start)) + 1

        # Extract JSON
        json_text = text[json_start : json_end]

        # Attempt to decode, and record the offsets of where the decode fails
        offset = 0

        while True:
            try:
                dat = json.loads(json_text[offset:])
                break
            except json.decoder.JSONDecodeError as e:
                # Extract failed location from the exception report
                failed_at = int(re.search(r'char\s*(\d+)', str(e)).group(1))
                offset = offset + failed_at + 1
                json_offsets.append(offset)

        # Extract each valid block and decode it to a list
        cur_offset = 0

        for offset in json_offsets:
            json_block = json_text[cur_offset : offset - 1]
            json_decoded.append(json.loads(json_block))
            cur_offset = offset

print(json_decoded)

This results in json_decoded holding two JSON entries.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM