简体繁体 English

快速在Python中解码大量“类似于JSON”的数据

[英]Decode big amount of “JSON-like” data in Python quickly

原文 2017-12-15 20:58:07 9 2 python/ json/ performance/ decode/ file-format

Say there are many (about 300,000) JSON files that take much time (about 30 minutes) to load into a list of Python objects. 假设有许多（约300,000个）JSON文件需要很长时间（约30分钟）才能加载到Python对象列表中。 Profiling revealed that it is in fact not the file access but the decoding, which takes most of the time. 分析显示，实际上这不是文件访问，而是解码，这花费了大部分时间。 Is there a format that I can convert these files to, which can be loaded much faster into a python list of objects? 有没有可以将这些文件转换为的格式，可以将其更快地加载到python对象列表中？

My attempt: I converted the files to ProtoBuf (aka Google's Proto col Buf fers) but even though I got really small files (reduced to ~20% of their original size), the time to load them did not improve that dramatically (still more than 20 minutes to load them all). 我的尝试：我将文件转换为ProtoBuf（又名Google的Proto col Buf fers），但是即使我得到的文件非常小（减小到原始大小的20％），加载文件的时间也没有明显改善（更多超过20分钟即可全部加载）。

2 个解决方案

You might be looking into the wrong direction with the conversion as it will probably not cut your loading times as much as you would like. 您可能正在寻找转换的错误方向，因为它可能不会像您希望的那样减少您的加载时间。 If the decoding is taking a lot of time, it will probably take quite some time from other formats as well, assuming that the JSON decoder is not really badly written. 如果解码花费很多时间，那么假设JSON解码器的写法不是很不错，那么其他格式也可能需要花费相当长的时间。 I am assuming the standard library functions have decent implementations, and JSON is not a lousy format for data storage speed-wise. 我假设标准库函数具有不错的实现，并且JSON对于数据存储速度而言不是一种糟糕的格式。

You could try running your program with PyPy instead of the default CPython implementation that I will assume you are using. 您可以尝试使用PyPy而不是我假设您正在使用的默认CPython实现来运行程序。 PyPy could decrease the execution time tremendously. PyPy可以大大减少执行时间。 It has a faster JSON module and uses a JIT which might speed up your program a lot . 它具有更快的JSON模块并使用JIT，这可能会大大加快程序的速度。

If you are using Python 3 you could also try using ProcessPoolExecutor to run the file loading and data deserialization / decoding concurrently. 如果您使用的是Python 3，则还可以尝试使用ProcessPoolExecutor来同时运行文件加载和数据反序列化/解码。 You will have to experiment with the degree of concurrency, but a good starting point is the number of your CPU cores, which you can halve or double. 您将不得不尝试并发度，但是一个不错的起点是CPU内核数，您可以将其减半或加倍。 If your program waits for I/O a lot, you should run a higher degree of concurrency, if the degree of I/O is smaller you can try and reduce the concurrency. 如果程序等待大量I / O，则应运行较高的并发度；如果I / O的度较小，则可以尝试减少并发度。 If you write each executor so that they load the data into Python objects and simply return them, you should be able to cut your loading times significantly. 如果编写每个执行器，以便它们将数据加载到Python对象中并简单地将它们返回，则应该能够大大减少加载时间。 Note that you must use a process-driven approach, using threads will not work with the GIL . 请注意，您必须使用流程驱动的方法，使用线程将不适用于GIL 。

You could also use a faster JSON library which could speed up your execution times two or three-fold in an optimal case. 您还可以使用更快的JSON库，在最佳情况下可以将执行时间提高两倍或三倍。 In a real-world use case the speed up will probably be smaller. 在实际的用例中，加速可能会更小。 Do note that these might not work with PyPy since it uses an alternative CFFI implementation and will not work with CPython programs, and PyPy has a good JSON module anyway. 请注意，这些可能不适用于PyPy，因为它使用替代CFFI实现，并且不适用于CPython程序，而且PyPy仍然具有良好的JSON模块。

Try ujson , it's quite a bit faster. 试试ujson ，它要快很多。

"Decoding takes most of the time" can be seen as "building the Python objects takes all the time". “解码花费大部分时间”可以看作是“构建Python对象总是花费时间”。 Do you really need all these things as Python objects in RAM all the time? 您是否真的一直需要所有这些东西作为RAM对象中的Python对象？ It must be quite a lot. 一定很多。

I'd consider using a proper database for eg querying data of such size. 我会考虑使用适当的数据库，例如查询这种大小的数据。

If you need mass processing of a different kind, eg stats or matrix processing, I'd take a look at pandas . 如果您需要其他类型的大量处理，例如统计或矩阵处理，那么我来看看pandas 。