[英]very large JSON handling in Python
I have a very large JSON file (~30GB, 65e6 lines) that I would like to process using some dataframe structure.我有一个非常大的 JSON 文件(~30GB,65e6 行),我想使用一些数据帧结构来处理它。 This dataset does of course not fit into my memory and therefore I ultimately want to use some out-of-memory solution like dask or vaex.
这个数据集当然不适合我的记忆,因此我最终想使用一些内存不足的解决方案,比如 dask 或 vaex。 I am aware that in order to do this I would first have to convert it into an already memory-mappable format like hdf5 (if you have suggestion for the format, I'll happily take them; the dataset includes categorical features among other things).
我知道,为了做到这一点,我首先必须将它转换成一种已经内存可映射的格式,如 hdf5(如果你对格式有建议,我会很乐意接受它们;数据集包括分类特征等) .
Two important facts about the dataset:关于数据集的两个重要事实:
Before I spend an awful lot of time with writing a script that extracts me all possible keys and streams each column of a chunk one-by-one to the hdf5-file and inserts NaNs wherever needed: Does anyone know a more elegant solution to this problem?在我花大量时间编写脚本提取所有可能的键并将块的每一列逐一传输到 hdf5 文件并在需要的地方插入 NaN 之前:有没有人知道更优雅的解决方案问题? Your help would be really appreciated.
非常感谢您的帮助。
PS Unfortunately I can't really share any data, but I hope that the explanations above describe the structure well enough. PS 不幸的是,我真的不能分享任何数据,但我希望上面的解释能够很好地描述结构。 If not I will try to provide similar examples.
如果没有,我会尝试提供类似的例子。
As a general rule, what you need is a stream/event-oriented JSON parser.作为一般规则,您需要的是一个面向流/事件的 JSON 解析器。 See for example json-stream .
参见例如json-stream 。 Such a parser can handle input of any size with a fixed amount of memory.
这样的解析器可以使用固定数量的内存处理任何大小的输入。 Instead of loading the entire JSON to memory, the parser calls your functions in response to individual elements in the tree.
解析器不会将整个 JSON 加载到内存中,而是调用您的函数来响应树中的各个元素。 You can write your processing in callback functions.
您可以在回调函数中编写您的处理。 If you need to do more complex or repeated processing of this data, it might make sense to store it in a database first.
如果您需要对此数据进行更复杂或重复的处理,那么首先将其存储在数据库中可能是有意义的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.