在Python中解析大型JSON文件

Question

I'm trying to parse a really large JSON file in Python. 我正在尝试用Python解析一个非常大的JSON文件。 The file has 6523440 lines but is broken into a lot of JSON objects. 该文件有6523440行，但分为很多JSON对象。

The structure looks like this: 结构如下所示：

[
  {
    "projects": [
     ...
    ]
  }
]
[
  {
    "projects": [
     ...
    ]
  }
]
....
....
....

and it goes on and on... 它一直持续......

Every time I try to load it using json.load() I get an error 每次我尝试使用json.load（）加载它时都会出错

ValueError: Extra data: line 2247 column 1 - line 6523440 column 1 (char 101207 - 295464118)

On the line where the first object ends and the second one starts. 在第一个对象结束而第二个对象开始的行上。 Is there a way to load them separately or anything similar? 有没有办法单独加载它们或类似的东西？

Answer 1

You can try using a streaming json library like ijson : 您可以尝试使用像ijson这样的流式json库：

Sometimes when dealing with a particularly large JSON payload it may worth to not even construct individual Python objects and react on individual events immediately producing some result 有时在处理特别大的JSON有效负载时，甚至可能不构造单个Python对象并对单个事件做出反应，立即产生一些结果

Answer 2

Try using json.JSONDecoder.raw_decode . 尝试使用json.JSONDecoder.raw_decode 。 It still requires you to have the entire document in memory, but allows you to iteratively decode many objects from one string. 它仍然要求您将整个文档放在内存中，但允许您从一个字符串迭代地解码许多对象。

import re
import json

document = """
[
    1,
    2,
    3
]
{
    "a": 1,
    "b": 2,
    "c": 3
}
"""

not_whitespace = re.compile(r"\S")

decoder = json.JSONDecoder()

items = []
index = 0
while True:
    match = not_whitespace.search(document, index)
    if not match:
        break

    item, index = decoder.raw_decode(document, match.start())
    items.append(item)

print(items)

在Python中解析大型JSON文件

问题描述

2 个解决方案

解决方案1
2 2015-10-29 14:28:09

解决方案2
0 2015-10-30 07:11:03

在Python中解析大型JSON文件

问题描述

2 个解决方案

解决方案1 2 2015-10-29 14:28:09

解决方案2 0 2015-10-30 07:11:03

解决方案1
2 2015-10-29 14:28:09

解决方案2
0 2015-10-30 07:11:03