使用Python处理非常丑陋的多行Json对象

Question

Okay, so another depressing day due to json beating me up pretty badly. 好的，由于json严重击败了我，因此又是令人沮丧的一天。 If this is not scary to someone than you are my new role model. 如果这并不让人害怕，那么您就是我的新榜样。 I'm sorry but I don't even have an even reasonable attempt at this. 很抱歉，我什至没有对此进行合理的尝试。 I have thousands of files that have the below structure and the below is even just a sample from the file so just imagine the below example for many more lines which I need to format into a csv format to load into a database and query. 我有成千上万个具有以下结构的文件，以下甚至只是该文件中的一个示例，因此，请想象以下示例中的更多行，这些行需要格式化为csv格式才能加载到数据库和查询中。 Yes, each line is technically a json object but each line has variable structures with some having nested keys and other do not. 是的，从技术上讲，每行都是json对象，但每行都具有可变结构，其中一些具有嵌套键，而其他则没有。 If someone can get me in the right direction than I would be tremendously grateful. 如果有人能让我朝着正确的方向前进，那我将不胜感激。 To make things slightly more terrible, the number of lines for a particular portion of the file is never consistent so when I tried to write a program that just read the top 20 lines for example because at least I can just process the top portion separately, I ran into an issue where the number was off. 为了使事情变得更糟，文件的特定部分的行数永远不会保持一致，因此，当我尝试编写仅读取前20行的程序时，例如，因为至少我可以单独处理顶部，我遇到了一个号码不详的问题。

This is what the top part of the file looks like: 这是文件顶部的样子：

{
"key":[
{"key":["val"],"key":{"key":"val","key":"val", "key":{"key":"val", "key":"val"}, "key":{"key":"val"}, "key":"val"}, "key":"val"},
{"key":["val","val","val","val"],"key":{"key":"val","key":"val"},"key":"val"},
{"key":["val"],"key":{"key":"val","key":"val", "key":{"key":"val", "key":"val"}, "key":{"key":"val"}, "key":"val"}, "key":"val"},
{"key":["val","val","val","val"],"key":{"key":"val","key":"val"},"key":"val"}
],

And this is what the bottom of the file looks like: 这是文件底部的样子：

"key":[
{"key":"val","key":"val","key":["val", "val", "val", "val", "val", "val"]},
{"key":"val","key":"val","key":["val", "val", "val", "val", "val", "val"]},
{"key":"val","key":"val","key":["val", "val", "val", "val", "val", "val"]}
]
}

Answer 1

Your given test data is kind of nasty because: 您给定的测试数据有点讨厌，因为：

you've replaced every key with "key", which makes json.load() return single-entry dictionaries with most of the data stomped on; 您已经用“ key”替换了每个键，这使得json.load（）返回单入口字典，其中包含了大部分数据；
it doesn't actually match your description; 它实际上与您的描述不符； it's a perfectly valid single json object, not a json object every few lines. 这是一个完全有效的单个json对象，而不是每隔几行就有一个json对象。

So I made up the following test data instead: 因此，我改为编写以下测试数据：

{"a": 35, "c": 16, "b": 98,
"e": 47, "d": 98, "f": 82}
{"a": 41, "c": 18, "b": 32, "e": 76, "d": 66, "f": 92}
{"a": 43, "c": 79, "b": 62, "e": 55,
"d": 86, "f": 61}
{"a": 47, "c": 49, "b": 87,
"e": 85, "d": 14, "f": 46}
{"a": 60, "c": 17, "b": 36, "e": 55, "d": 25, "f": 84}
{"a": 61, "c": 38, "b": 93, "e": 26, "d": 12, "f": 82}

then I found the following 然后我发现以下

import json

def iload_json(buff, decoder=None, _w=json.decoder.WHITESPACE.match):
    # found at http://www.benweaver.com/blog/decode-multiple-json-objects-in-python.html
    """Generate a sequence of top-level JSON values declared in the
    buffer.

    >>> list(iload_json('[1, 2] "a" { "c": 3 }'))
    [[1, 2], u'a', {u'c': 3}]
    """
    decoder = decoder or json._default_decoder
    idx = _w(buff, 0).end()
    end = len(buff)
    try:
        while idx != end:
            (val, idx) = decoder.raw_decode(buff, idx=idx)
            yield val
            idx = _w(buff, idx).end()
    except ValueError as exc:
        raise ValueError('%s (%r at position %d).' % (exc, buff[idx:], idx))

which can be used as 可以用作

import glob
from itertools import chain

def gen_json_from_file(fname):
    with open(fname) as inf:
        try:
            for obj in iload_json(inf.read()):
                yield obj
        except ValueError, e:
            print("Error parsing file '{}': {}".format(fname, e.message))

def gen_json_from_files(filespec):
    return chain(*(gen_json_from_file(fname) for fname in glob.glob(filespec)))

for obj in gen_json_from_files("*.json")):
    try:
        print(obj["a"])
    except KeyError:
        pass

which (run against the above test data saved twice as "a.json" and "b.json") results in 哪一个（针对以上两次保存为“ a.json”和“ b.json”的测试数据运行）导致

as expected. 如预期的那样。

Answer 2

So - parsing this is not that difficult, although, given your samples, a little easier than what you describe. 因此，尽管给定示例，但解析起来并不困难，尽管比您描述的要容易一些。

If "each line is a JSON object" - all you'd have to do is to feed each line in to the json parser, and collect the resulting object in a list: 如果“每一行是一个JSON对象”，您要做的就是将每行输入json解析器，并将结果对象收集在一个列表中：

import json
for filename in os.listitdir(<path_to_thousands_of_json_files>):
    data = [] 
    with open(filename) as jsonfile:
       for line in jsonfile:
           if not line.strip(): continue #avoid crash at empty lines and newline at end of file       data.append(json.loads(line.strip()))
    # do your CSV output processing here.

But, on the samples above, each line is not a complete json file - it is more like the whole file is valid json object, as is the norm, so jsut doing: 但是，在上面的示例中，每一行都不是一个完整的json文件-就像整个文件都是有效的json对象一样，就像规范一样，因此jsut会这样做：

import json
for filename in os.listitdir(<path_to_thousands_of_json_files>):
    data = json.load(open(filename))
    # do CSV output

should do the job for you. 应该为您完成工作。

Now, that is for parsing - and if your question is only about this, these should suffice as an answer. 现在，这是解析-如果您的问题仅与此有关，那么这些就足够了。 I suppose picking up the meaning of the data and selecting the fields and title to output in each CSV resulting file will be a bigger problem - but them, maybe you could work until you get the parsing working, and post more questions with more specific examples of what you are trying to get; 我想了解数据的含义并选择字段和标题以在每个CSV结果文件中输出将是一个更大的问题-但是它们，也许您可以使用直到解析工作完成，并通过更具体的示例发布更多问题您想要得到的东西；

Note that for processing thousands of files, it will be wise to use the iterator "pattern" f Python, so that you can keep the above parsing logic separate from the part where you process the data and create the output, and have a single JSON file parsed in memory at each time: 请注意，要处理成千上万个文件，明智的做法是在Python中使用迭代器“模式”，以便可以将上述解析逻辑与处理数据和创建输出的部分分开，并具有单个JSON每次在内存中解析的文件：

import json

def get_json_data(path_to_files):
    for filename in os.listitdir(path_to_files):
        data = json.load(open(filename))
        yield data

def main():
    for data in get_json_data(<path_to_files>):
         # implement CSV logic here.

使用Python处理非常丑陋的多行Json对象

问题描述

2 个解决方案

解决方案1
1 已采纳 2014-02-08 04:07:39

解决方案2
0 2014-02-08 02:45:26

使用Python处理非常丑陋的多行Json对象

问题描述

2 个解决方案

解决方案1 1 已采纳 2014-02-08 04:07:39

解决方案2 0 2014-02-08 02:45:26

解决方案1
1 已采纳 2014-02-08 04:07:39

解决方案2
0 2014-02-08 02:45:26