如何在 Python 中高效解析大型 JSON 文件？

Question

我有一個包含 JSON 對象數組的文件。 該文件超過 1GB，因此我無法一次將其加載到內存中。 我需要解析每個單獨的對象。 我嘗試使用ijson ，但這會將整個數組作為一個對象加載，有效地執行與簡單json.load()相同的事情。

還有另一種方法嗎？

編輯：沒關系，只需使用ijson.items()並將前綴參數設置為"item" 。

Answer 1

您可以解析JSON文件一次，找到每個一級分隔符的位置，即作為頂級對象一部分的逗號，然后將文件分成由這些位置指示的部分。 例如：

{"a": [1, 2, 3], "b": "Hello, World!", "c": {"d": 4, "e": 5}}
        ^      ^            ^        ^             ^
        |      |            |        |             |
     level-2   |         quoted      |          level-2
               |                     |
            level-1               level-1

在這里，我們要找到 1 級逗號，它將頂級對象包含的對象分隔開。 我們可以使用一個生成器來解析 JSON 流並跟蹤進入和退出嵌套對象。 當它遇到一個沒有引用的 1 級逗號時，它會產生相應的位置：

def find_sep_pos(stream, *, sep=','):
    level = 0
    quoted = False  # handling strings in the json
    backslash = False  # handling quoted quotes
    for pos, char in enumerate(stream):
        if backslash:
            backslash = False
        elif char in '{[':
            level += 1
        elif char in ']}':
            level -= 1
        elif char == '"':
            quoted = not quoted
        elif char == '\\':
            backslash = True
        elif char == sep and not quoted and level == 1:
            yield pos

用於上面的示例數據，這將給出list(find_sep_pos(example)) == [15, 37] 。

然后我們可以將文件分成與分隔符位置相對應的部分，並通過json.loads單獨加載每個部分：

import itertools as it
import json

with open('example.json') as fh:
    # Iterating over `fh` yields lines, so we chain them in order to get characters.
    sep_pos = tuple(find_sep_pos(it.chain.from_iterable(fh)))
    fh.seek(0)  # reset to the beginning of the file
    stream = it.chain.from_iterable(fh)
    opening_bracket = next(stream)
    closing_bracket = dict(('{}', '[]'))[opening_bracket]
    offset = 1  # the bracket we just consumed adds an offset of 1
    for pos in sep_pos:
        json_str = (
            opening_bracket
            + ''.join(it.islice(stream, pos - offset))
            + closing_bracket
        )
        obj = json.loads(json_str)  # this is your object
        next(stream)  # step over the separator
        offset = pos + 1  # adjust where we are in the stream right now
        print(obj)
    # The last object still remains in the stream, so we load it here.
    obj = json.loads(opening_bracket + ''.join(stream))
    print(obj)

Answer 2

2個選項

使用 JQ 等工具在 CLI 中解析，然后將其帶到 Python 進行進一步處理。
使用 PySpark 解析（社區數據磚為您提供可用空間）

JQ怎么用

如何在 Python 中高效解析大型 JSON 文件？

問題描述

2 個解決方案

解決方案1
3 2020-03-16 10:35:10

解決方案2
0 2020-03-17 20:31:37

如何在 Python 中高效解析大型 JSON 文件？

問題描述

2 個解決方案

解決方案1 3 2020-03-16 10:35:10

解決方案2 0 2020-03-17 20:31:37

解決方案1
3 2020-03-16 10:35:10

解決方案2
0 2020-03-17 20:31:37