简体   繁体   English

使用python ijson读取带有多个json对象的大型json文件

[英]Using python ijson to read a large json file with multiple json objects

I'm trying to parse a large (~100MB) json file using ijson package which allows me to interact with the file in an efficient way. 我正在尝试使用ijson包解析一个大的(~100MB)json文件,它允许我以有效的方式与文件交互。 However, after writing some code like this, 但是,在编写了这样的代码之后,

with open(filename, 'r') as f:
    parser = ijson.parse(f)
    for prefix, event, value in parser:
        if prefix == "name":
            print(value)

I found that the code parses only the first line and not the rest of the lines from the file!! 我发现代码只解析第一行,而不解析文件中的其余行!

Here is how a portion of my json file looks like: 以下是我的json文件的一部分:

{"name":"accelerator_pedal_position","value":0,"timestamp":1364323939.012000}
{"name":"engine_speed","value":772,"timestamp":1364323939.027000}
{"name":"vehicle_speed","value":0,"timestamp":1364323939.029000}
{"name":"accelerator_pedal_position","value":0,"timestamp":1364323939.035000}

In my opinion, I think ijson parses only one json object. 在我看来,我认为ijson只解析一个json对象。

Can someone please suggest how to work around this? 有人可以建议如何解决这个问题?

Unfortunately the ijson library (v2.3 as of March 2018) does not handle parsing multiple JSON objects. 不幸的是, ijson库(截至2018年3月的v2.3)不处理解析多个JSON对象。 It can only handle 1 overall object, and if you attempt to parse a second object, you will get an error: "ijson.common.JSONError: Additional data" . 它只能处理1个整体对象,如果你试图解析第二个对象,你会收到一个错误: "ijson.common.JSONError: Additional data" See bug reports here: 在此处查看错误报告:

It's a big limitation. 这是一个很大的限制。 However, as long as you have line breaks (new line character) after each JSON object, you can parse each one line-by-line independently , like this: 但是,只要你有每个JSON对象后换行符(新行字符),可以单独分析每一行由行,像这样:

import io
import ijson

with open(filename, encoding="UTF-8") as json_file:
    cursor = 0
    for line_number, line in enumerate(json_file):
        print ("Processing line", line_number + 1,"at cursor index:", cursor)
        line_as_file = io.StringIO(line)
        # Use a new parser for each line
        json_parser = ijson.parse(line_as_file)
        for prefix, type, value in json_parser:
            print ("prefix=",prefix, "type=",type, "value=",value)
        cursor += len(line)

You are still streaming the file, and not loading it entirely in memory, so it can work on large JSON files. 您仍在流式传输文件,而不是将其完全加载到内存中,因此它可以处理大型JSON文件。 It also uses the line streaming technique from: How to jump to a particular line in a huge text file? 它还使用以下行线技术: 如何跳转到巨大文本文件中的特定行? and uses enumerate() from: Accessing the index in 'for' loops? 并使用enumerate() from: 在'for'循环中访问索引?

Since the provided chunk looks more like a set of lines each composing an independent JSON, it should be parsed accordingly: 由于提供的块看起来更像是一组构成独立JSON的行,因此应该对其进行解析:

# each JSON is small, there's no need in iterative processing
import json 
with open(filename, 'r') as f:
    for line in f:
        data = json.loads(line)
        # data[u'name'], data[u'engine_speed'], data[u'timestamp'] now
        # contain correspoding values

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM