简体   繁体   中英

Prossing Very Ugly Multi-line Json Object with Python

Okay, so another depressing day due to json beating me up pretty badly. If this is not scary to someone than you are my new role model. I'm sorry but I don't even have an even reasonable attempt at this. I have thousands of files that have the below structure and the below is even just a sample from the file so just imagine the below example for many more lines which I need to format into a csv format to load into a database and query. Yes, each line is technically a json object but each line has variable structures with some having nested keys and other do not. If someone can get me in the right direction than I would be tremendously grateful. To make things slightly more terrible, the number of lines for a particular portion of the file is never consistent so when I tried to write a program that just read the top 20 lines for example because at least I can just process the top portion separately, I ran into an issue where the number was off.

This is what the top part of the file looks like:

{
"key":[
{"key":["val"],"key":{"key":"val","key":"val", "key":{"key":"val", "key":"val"}, "key":{"key":"val"}, "key":"val"}, "key":"val"},
{"key":["val","val","val","val"],"key":{"key":"val","key":"val"},"key":"val"},
{"key":["val"],"key":{"key":"val","key":"val", "key":{"key":"val", "key":"val"}, "key":{"key":"val"}, "key":"val"}, "key":"val"},
{"key":["val","val","val","val"],"key":{"key":"val","key":"val"},"key":"val"}
],

And this is what the bottom of the file looks like:

"key":[
{"key":"val","key":"val","key":["val", "val", "val", "val", "val", "val"]},
{"key":"val","key":"val","key":["val", "val", "val", "val", "val", "val"]},
{"key":"val","key":"val","key":["val", "val", "val", "val", "val", "val"]}
]
}

Your given test data is kind of nasty because:

  1. you've replaced every key with "key", which makes json.load() return single-entry dictionaries with most of the data stomped on;

  2. it doesn't actually match your description; it's a perfectly valid single json object, not a json object every few lines.

So I made up the following test data instead:

{"a": 35, "c": 16, "b": 98,
"e": 47, "d": 98, "f": 82}
{"a": 41, "c": 18, "b": 32, "e": 76, "d": 66, "f": 92}
{"a": 43, "c": 79, "b": 62, "e": 55,
"d": 86, "f": 61}
{"a": 47, "c": 49, "b": 87,
"e": 85, "d": 14, "f": 46}
{"a": 60, "c": 17, "b": 36, "e": 55, "d": 25, "f": 84}
{"a": 61, "c": 38, "b": 93, "e": 26, "d": 12, "f": 82}

then I found the following

import json

def iload_json(buff, decoder=None, _w=json.decoder.WHITESPACE.match):
    # found at http://www.benweaver.com/blog/decode-multiple-json-objects-in-python.html
    """Generate a sequence of top-level JSON values declared in the
    buffer.

    >>> list(iload_json('[1, 2] "a" { "c": 3 }'))
    [[1, 2], u'a', {u'c': 3}]
    """
    decoder = decoder or json._default_decoder
    idx = _w(buff, 0).end()
    end = len(buff)
    try:
        while idx != end:
            (val, idx) = decoder.raw_decode(buff, idx=idx)
            yield val
            idx = _w(buff, idx).end()
    except ValueError as exc:
        raise ValueError('%s (%r at position %d).' % (exc, buff[idx:], idx))

which can be used as

import glob
from itertools import chain

def gen_json_from_file(fname):
    with open(fname) as inf:
        try:
            for obj in iload_json(inf.read()):
                yield obj
        except ValueError, e:
            print("Error parsing file '{}': {}".format(fname, e.message))

def gen_json_from_files(filespec):
    return chain(*(gen_json_from_file(fname) for fname in glob.glob(filespec)))

for obj in gen_json_from_files("*.json")):
    try:
        print(obj["a"])
    except KeyError:
        pass

which (run against the above test data saved twice as "a.json" and "b.json") results in

35
41
43
47
60
61
35
41
43
47
60
61

as expected.

So - parsing this is not that difficult, although, given your samples, a little easier than what you describe.

If "each line is a JSON object" - all you'd have to do is to feed each line in to the json parser, and collect the resulting object in a list:

import json
for filename in os.listitdir(<path_to_thousands_of_json_files>):
    data = [] 
    with open(filename) as jsonfile:
       for line in jsonfile:
           if not line.strip(): continue #avoid crash at empty lines and newline at end of file       data.append(json.loads(line.strip()))
    # do your CSV output processing here.

But, on the samples above, each line is not a complete json file - it is more like the whole file is valid json object, as is the norm, so jsut doing:

import json
for filename in os.listitdir(<path_to_thousands_of_json_files>):
    data = json.load(open(filename))
    # do CSV output

should do the job for you.

Now, that is for parsing - and if your question is only about this, these should suffice as an answer. I suppose picking up the meaning of the data and selecting the fields and title to output in each CSV resulting file will be a bigger problem - but them, maybe you could work until you get the parsing working, and post more questions with more specific examples of what you are trying to get;

Note that for processing thousands of files, it will be wise to use the iterator "pattern" f Python, so that you can keep the above parsing logic separate from the part where you process the data and create the output, and have a single JSON file parsed in memory at each time:

import json

def get_json_data(path_to_files):
    for filename in os.listitdir(path_to_files):
        data = json.load(open(filename))
        yield data

def main():
    for data in get_json_data(<path_to_files>):
         # implement CSV logic here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM