处理具有多个根元素的大型 JSON 并读入 pandas dataframe

Question

I want to (pre)process large JSON files (5-10GB each), which contain multiple root elements.我想（预）处理包含多个根元素的大型 JSON 文件（每个 5-10GB）。 These root elements follow each other without separator like this: {}{}....这些根元素彼此跟随，没有分隔符，如下所示：{}{}....

So I first wrote the following simple code to get a valid JSON File:所以我先写了下面的简单代码，得到一个有效的 JSON 文件：

with open(file) as f: 
    file_data = f.read()
    file_data = file_data.replace("}{", "},{") 
    file_data = "[" + file_data + "]"
    df = pd.read_json(file_data)

Obviously this doesn´t work with large files.显然这不适用于大文件。 Even the 400MB file doesn´t work.即使是 400MB 的文件也不起作用。 (I´ve got 16GB memory) （我有 16GB 内存）

I´ve read that it´s possible to work with chunks but I don´t manage to get this in ''chunk logic'' Is there a way to ''chunkenize'' this?我已经读过可以使用块，但我无法在“块逻辑”中得到它有没有办法“块化”这个？

I am glad for you help.我很高兴你的帮助。

Answer 1

I am having a hard time visualizing the multiple root element idea, but you should write the file_data contents to disk and try reading it in separately.我很难想象多根元素的想法，但是您应该将 file_data 内容写入磁盘并尝试单独读取它。 If you have the file open it will consume RAM in addition to having the RAM consumed by the file_data object (and possibly even the modified object, though that's a garbage collector question. I think garbage collection gets done after the function returns.) Try using f.close explicitly instead of the with and return that from a separate function.如果您打开文件，除了 file_data object （甚至可能是修改后的 object，尽管这是一个垃圾收集器问题。我认为垃圾收集在 ZC1C425268E68385D1AB5074 返回后完成）之外，它还会消耗 RAM。 f.close 显式而不是 with 并从单独的 function 返回。

处理具有多个根元素的大型 JSON 并读入 pandas dataframe

问题描述

1 个解决方案

解决方案1
0 2019-09-27 17:41:41

处理具有多个根元素的大型 JSON 并读入 pandas dataframe

问题描述

1 个解决方案

解决方案1 0 2019-09-27 17:41:41

解决方案1
0 2019-09-27 17:41:41