简体   繁体   中英

Decode big amount of “JSON-like” data in Python quickly

Say there are many (about 300,000) JSON files that take much time (about 30 minutes) to load into a list of Python objects. Profiling revealed that it is in fact not the file access but the decoding, which takes most of the time. Is there a format that I can convert these files to, which can be loaded much faster into a python list of objects?

My attempt: I converted the files to ProtoBuf (aka Google's Proto col Buf fers) but even though I got really small files (reduced to ~20% of their original size), the time to load them did not improve that dramatically (still more than 20 minutes to load them all).

You might be looking into the wrong direction with the conversion as it will probably not cut your loading times as much as you would like. If the decoding is taking a lot of time, it will probably take quite some time from other formats as well, assuming that the JSON decoder is not really badly written. I am assuming the standard library functions have decent implementations, and JSON is not a lousy format for data storage speed-wise.

You could try running your program with PyPy instead of the default CPython implementation that I will assume you are using. PyPy could decrease the execution time tremendously. It has a faster JSON module and uses a JIT which might speed up your program a lot .

If you are using Python 3 you could also try using ProcessPoolExecutor to run the file loading and data deserialization / decoding concurrently. You will have to experiment with the degree of concurrency, but a good starting point is the number of your CPU cores, which you can halve or double. If your program waits for I/O a lot, you should run a higher degree of concurrency, if the degree of I/O is smaller you can try and reduce the concurrency. If you write each executor so that they load the data into Python objects and simply return them, you should be able to cut your loading times significantly. Note that you must use a process-driven approach, using threads will not work with the GIL .

You could also use a faster JSON library which could speed up your execution times two or three-fold in an optimal case. In a real-world use case the speed up will probably be smaller. Do note that these might not work with PyPy since it uses an alternative CFFI implementation and will not work with CPython programs, and PyPy has a good JSON module anyway.

Try ujson , it's quite a bit faster.

"Decoding takes most of the time" can be seen as "building the Python objects takes all the time". Do you really need all these things as Python objects in RAM all the time? It must be quite a lot.

I'd consider using a proper database for eg querying data of such size.

If you need mass processing of a different kind, eg stats or matrix processing, I'd take a look at pandas .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM