简体繁体 English

Python 中非常大的 JSON 处理

[英]very large JSON handling in Python

原文 2022-12-20 04:11:42 9 1 json/ pandas/ dask/ hdf5/ vaex

I have a very large JSON file (~30GB, 65e6 lines) that I would like to process using some dataframe structure.我有一个非常大的 JSON 文件（~30GB，65e6 行），我想使用一些数据帧结构来处理它。 This dataset does of course not fit into my memory and therefore I ultimately want to use some out-of-memory solution like dask or vaex.这个数据集当然不适合我的记忆，因此我最终想使用一些内存不足的解决方案，比如 dask 或 vaex。 I am aware that in order to do this I would first have to convert it into an already memory-mappable format like hdf5 (if you have suggestion for the format, I'll happily take them; the dataset includes categorical features among other things).我知道，为了做到这一点，我首先必须将它转换成一种已经内存可映射的格式，如 hdf5（如果你对格式有建议，我会很乐意接受它们；数据集包括分类特征等） .

Two important facts about the dataset:关于数据集的两个重要事实：

The data is structured as a list and each dict-style JSON object is then on a single line.数据结构为一个列表，然后每个 dict 样式的 JSON 对象都在一行中。 This means that I can very easily convert it to line-delimited JSON by parsing it and removing square brackets and commas, which is good.这意味着我可以通过解析它并删除方括号和逗号来非常轻松地将它转换为以行分隔的 JSON，这很好。
The JSON objects are deeply nested and there's variability in the presence of keys among them. JSON 对象嵌套很深，它们之间的键存在差异。 This means that if I use a JSON reader for line-delimited JSON that reads chunks sequentially (like pandas.read_json() with specified lines=True and chunksize=int) the resulting dataframes after flattening (pd.json_normalize) might not have the same columns, which is bad for streaming them into an hdf5 file.这意味着如果我使用 JSON 阅读器来读取按顺序读取块的行分隔 JSON（如 pandas.read_json() with specified lines=True and chunksize=int），则展平 (pd.json_normalize) 后生成的数据帧可能不相同列，这不利于将它们流式传输到 hdf5 文件中。

Before I spend an awful lot of time with writing a script that extracts me all possible keys and streams each column of a chunk one-by-one to the hdf5-file and inserts NaNs wherever needed: Does anyone know a more elegant solution to this problem?在我花大量时间编写脚本提取所有可能的键并将块的每一列逐一传输到 hdf5 文件并在需要的地方插入 NaN 之前：有没有人知道更优雅的解决方案问题？ Your help would be really appreciated.非常感谢您的帮助。

PS Unfortunately I can't really share any data, but I hope that the explanations above describe the structure well enough. PS 不幸的是，我真的不能分享任何数据，但我希望上面的解释能够很好地描述结构。 If not I will try to provide similar examples.如果没有，我会尝试提供类似的例子。

1 个解决方案

As a general rule, what you need is a stream/event-oriented JSON parser.作为一般规则，您需要的是一个面向流/事件的 JSON 解析器。 See for example json-stream .参见例如json-stream 。 Such a parser can handle input of any size with a fixed amount of memory.这样的解析器可以使用固定数量的内存处理任何大小的输入。 Instead of loading the entire JSON to memory, the parser calls your functions in response to individual elements in the tree.解析器不会将整个 JSON 加载到内存中，而是调用您的函数来响应树中的各个元素。 You can write your processing in callback functions.您可以在回调函数中编写您的处理。 If you need to do more complex or repeated processing of this data, it might make sense to store it in a database first.如果您需要对此数据进行更复杂或重复的处理，那么首先将其存储在数据库中可能是有意义的。