简体   繁体   中英

Parsing incomplete json array

I have downloaded 5MB of a very large json file. From this, I need to be able to load that 5MB to generate a preview of the json file. However, the file will probably be incomplete. Here's an example of what it may look like:

[{
    "first": "bob",
    "address": {
        "street": 13301,
        "zip": 1920
    }
}, {
    "first": "sarah",
    "address": {
        "street": 13301,
        "zip": 1920
    }
}, {"first" : "tom"

From here, I'd like to "rebuild it" so that it can parse the first two objects (and ignore the third).

Is there a json parser that can infer or cut off the end of the string to make it parsable? Or perhaps to 'stream' the parsing of the json array, so that when it fails on the last object, I can exit the loop? If not, how could the above be accomplished?

If your data will always look somewhat similar, you could do something like this:

import json

json_string = """[{
    "first": "bob",
    "address": {
        "street": 13301,
        "zip": 1920
    }
}, {
    "first": "sarah",
    "address": {
        "street": 13301,
        "zip": 1920
    }
}, {"first" : "tom"
"""

while True:
    if not json_string:
        raise ValueError("Couldn't fix JSON")
    try:
        data = json.loads(json_string + "]")
    except json.decoder.JSONDecodeError:
        json_string = json_string[:-1]
        continue
    break

print(data)

This assumes that the data is a list of dicts. Step by step, the last character is removed and a missing ] appended. If the new string can be interpreted as JSON, the infinite loop breaks. Otherwise the next character is removed and so on. If there are no characters left ValueError("Couldn't fix JSON") is raised.

For the above example, it prints:

[{'first': 'bob', 'address': {'zip': 1920, 'street': 13301}}, {'first': 'sarah', 'address': {'zip': 1920, 'street': 13301}}]

For the specific structure in the example we can walk through the string and track occurrences of curly brackets and their closing counterparts. If at the end one or more curly brackets remain unmatched, we know that this indicates an incomplete object. We can then strip any intermediate characters such as commas or whitespace and close the resulting string with a square bracket.

This method ensures that the string is only parsed twice, one time manually and one time by the JSON parser, which might be advantageous for large text files (with incomplete objects consisting of many characters).

brackets = []
for i, c in enumerate(string):
    if c == '{':
        brackets.append(i)
    elif c == '}':
        brackets.pop()

if brackets:
    string = string[:brackets[0]].rstrip(', \n')

if not string.endswith(']'):
    string += ']'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM