使用python ijson读取带有多个json对象的大型json文件

Question

I'm trying to parse a large (~100MB) json file using ijson package which allows me to interact with the file in an efficient way. 我正在尝试使用ijson包解析一个大的（~100MB）json文件，它允许我以有效的方式与文件交互。 However, after writing some code like this, 但是，在编写了这样的代码之后，

with open(filename, 'r') as f:
    parser = ijson.parse(f)
    for prefix, event, value in parser:
        if prefix == "name":
            print(value)

I found that the code parses only the first line and not the rest of the lines from the file!! 我发现代码只解析第一行，而不解析文件中的其余行！

Here is how a portion of my json file looks like: 以下是我的json文件的一部分：

{"name":"accelerator_pedal_position","value":0,"timestamp":1364323939.012000}
{"name":"engine_speed","value":772,"timestamp":1364323939.027000}
{"name":"vehicle_speed","value":0,"timestamp":1364323939.029000}
{"name":"accelerator_pedal_position","value":0,"timestamp":1364323939.035000}

In my opinion, I think ijson parses only one json object. 在我看来，我认为ijson只解析一个json对象。

Can someone please suggest how to work around this? 有人可以建议如何解决这个问题？

Answer 1

Unfortunately the ijson library (v2.3 as of March 2018) does not handle parsing multiple JSON objects. 不幸的是， ijson库（截至2018年3月的v2.3）不处理解析多个JSON对象。 It can only handle 1 overall object, and if you attempt to parse a second object, you will get an error: "ijson.common.JSONError: Additional data" . 它只能处理1个整体对象，如果你试图解析第二个对象，你会收到一个错误： "ijson.common.JSONError: Additional data" 。 See bug reports here: 在此处查看错误报告：

It's a big limitation. 这是一个很大的限制。 However, as long as you have line breaks (new line character) after each JSON object, you can parse each one line-by-line independently , like this: 但是，只要你有每个JSON对象后换行符（新行字符），可以单独分析每一行由行，像这样：

import io
import ijson

with open(filename, encoding="UTF-8") as json_file:
    cursor = 0
    for line_number, line in enumerate(json_file):
        print ("Processing line", line_number + 1,"at cursor index:", cursor)
        line_as_file = io.StringIO(line)
        # Use a new parser for each line
        json_parser = ijson.parse(line_as_file)
        for prefix, type, value in json_parser:
            print ("prefix=",prefix, "type=",type, "value=",value)
        cursor += len(line)

You are still streaming the file, and not loading it entirely in memory, so it can work on large JSON files. 您仍在流式传输文件，而不是将其完全加载到内存中，因此它可以处理大型JSON文件。 It also uses the line streaming technique from: How to jump to a particular line in a huge text file? 它还使用以下行线技术：如何跳转到巨大文本文件中的特定行？ and uses enumerate() from: Accessing the index in 'for' loops? 并使用enumerate() from：在'for'循环中访问索引？

Answer 2

Since the provided chunk looks more like a set of lines each composing an independent JSON, it should be parsed accordingly: 由于提供的块看起来更像是一组构成独立JSON的行，因此应该对其进行解析：

# each JSON is small, there's no need in iterative processing
import json 
with open(filename, 'r') as f:
    for line in f:
        data = json.loads(line)
        # data[u'name'], data[u'engine_speed'], data[u'timestamp'] now
        # contain correspoding values

使用python ijson读取带有多个json对象的大型json文件

问题描述

2 个解决方案

解决方案1
6 2018-03-01 21:33:05

解决方案2
5 已采纳 2016-05-13 03:08:19

使用python ijson读取带有多个json对象的大型json文件

问题描述

2 个解决方案

解决方案1 6 2018-03-01 21:33:05

解决方案2 5 已采纳 2016-05-13 03:08:19

解决方案1
6 2018-03-01 21:33:05

解决方案2
5 已采纳 2016-05-13 03:08:19