简体   繁体   中英

very large JSON handling in Python

I have a very large JSON file (~30GB, 65e6 lines) that I would like to process using some dataframe structure. This dataset does of course not fit into my memory and therefore I ultimately want to use some out-of-memory solution like dask or vaex. I am aware that in order to do this I would first have to convert it into an already memory-mappable format like hdf5 (if you have suggestion for the format, I'll happily take them; the dataset includes categorical features among other things).

Two important facts about the dataset:

  1. The data is structured as a list and each dict-style JSON object is then on a single line. This means that I can very easily convert it to line-delimited JSON by parsing it and removing square brackets and commas, which is good.
  2. The JSON objects are deeply nested and there's variability in the presence of keys among them. This means that if I use a JSON reader for line-delimited JSON that reads chunks sequentially (like pandas.read_json() with specified lines=True and chunksize=int) the resulting dataframes after flattening (pd.json_normalize) might not have the same columns, which is bad for streaming them into an hdf5 file.

Before I spend an awful lot of time with writing a script that extracts me all possible keys and streams each column of a chunk one-by-one to the hdf5-file and inserts NaNs wherever needed: Does anyone know a more elegant solution to this problem? Your help would be really appreciated.

PS Unfortunately I can't really share any data, but I hope that the explanations above describe the structure well enough. If not I will try to provide similar examples.

As a general rule, what you need is a stream/event-oriented JSON parser. See for example json-stream . Such a parser can handle input of any size with a fixed amount of memory. Instead of loading the entire JSON to memory, the parser calls your functions in response to individual elements in the tree. You can write your processing in callback functions. If you need to do more complex or repeated processing of this data, it might make sense to store it in a database first.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM