简体繁体中英

very large JSON handling in Python

原文 2022-12-20 04:11:42 1 1 json/ pandas/ dask/ hdf5/ vaex

I have a very large JSON file (~30GB, 65e6 lines) that I would like to process using some dataframe structure. This dataset does of course not fit into my memory and therefore I ultimately want to use some out-of-memory solution like dask or vaex. I am aware that in order to do this I would first have to convert it into an already memory-mappable format like hdf5 (if you have suggestion for the format, I'll happily take them; the dataset includes categorical features among other things).

Two important facts about the dataset:

The data is structured as a list and each dict-style JSON object is then on a single line. This means that I can very easily convert it to line-delimited JSON by parsing it and removing square brackets and commas, which is good.
The JSON objects are deeply nested and there's variability in the presence of keys among them. This means that if I use a JSON reader for line-delimited JSON that reads chunks sequentially (like pandas.read_json() with specified lines=True and chunksize=int) the resulting dataframes after flattening (pd.json_normalize) might not have the same columns, which is bad for streaming them into an hdf5 file.

Before I spend an awful lot of time with writing a script that extracts me all possible keys and streams each column of a chunk one-by-one to the hdf5-file and inserts NaNs wherever needed: Does anyone know a more elegant solution to this problem? Your help would be really appreciated.

PS Unfortunately I can't really share any data, but I hope that the explanations above describe the structure well enough. If not I will try to provide similar examples.

1 answers

As a general rule, what you need is a stream/event-oriented JSON parser. See for example json-stream . Such a parser can handle input of any size with a fixed amount of memory. Instead of loading the entire JSON to memory, the parser calls your functions in response to individual elements in the tree. You can write your processing in callback functions. If you need to do more complex or repeated processing of this data, it might make sense to store it in a database first.

Handling Very Large JSON Dataset?

Handling with large JSON data in python

Python json parsing very large files

Python - Convert Very Large (6.4GB) XML files to JSON

Python: Extract specific data from a very large JSON request

Json to csv conversion taking very huge time on large files in python

python handling large json response from wikipedia api

Handling large json responses - Android

Handling Large JSON bodies in Salesforce

How convert a very large JSON response of a web service into CSV using Python?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Handling Very Large JSON Dataset? Handling with large JSON data in python Python json parsing very large files Python - Convert Very Large (6.4GB) XML files to JSON Python: Extract specific data from a very large JSON request Json to csv conversion taking very huge time on large files in python python handling large json response from wikipedia api Handling large json responses - Android Handling Large JSON bodies in Salesforce How convert a very large JSON response of a web service into CSV using Python?

Related Tags

very large JSON handling in Python

Question

1 answers

solution1 0 2022-12-20 07:28:54

solution1
0 2022-12-20 07:28:54