简体   繁体   English

用Python读取数千个JSON文件的最快方法

[英]Fastest Method To Read Thousands of JSON Files in Python

I have a number of JSON files I need to analyze. 我有许多需要分析的JSON文件。 I am using iPython ( Python 3.5.2 | IPython 5.0.0 ), reading in the files to a dictionary and appending each dictionary to a list. 我正在使用iPython( Python 3.5.2 | IPython 5.0.0 ),将文件读入字典并将每个字典附加到列表中。

My main bottleneck is reading in the files. 我的主要瓶颈是读取文件。 Some files are smaller, and are read quickly, but the larger files are slowing me down. 有些文件较小,可以快速读取,但是较大的文件会使我慢下来。

Here is some example code (sorry, I cannot provide the actual data files): 这是一些示例代码(对不起,我无法提供实际的数据文件):

import json
import glob

def read_json_files(path_to_file):
    with open(path_to_file) as p:
        data = json.load(p)
        p.close()
    return data

def giant_list(json_files):
    data_list = []
    for f in json_files:
        data_list.append(read_json_files(f))
    return data_list

support_files = glob.glob('/Users/path/to/support_tickets_*.json')
small_file_test = giant_list(support_files)

event_files = glob.glob('/Users/path/to/google_analytics_data_*.json')
large_file_test = giant_list(event_files)

The support tickets are very small in size--largest I've seen is 6KB. 支持票非常小-我见过的最大票为6KB。 So, this code runs pretty fast: 因此,这段代码运行得非常快:

In [3]: len(support_files)
Out[3]: 5278

In [5]: %timeit giant_list(support_files)
1 loop, best of 3: 557 ms per loop

But larger files definitely are slowing me down...these event files can reach ~2.5MB each: 但是较大的文件肯定会让我慢下来...这些事件文件每个可以达到〜2.5MB:

In [7]: len(event_files) # there will be a lot more of these soon :-/
Out[7]: 397

In [8]: %timeit giant_list(event_files)
1 loop, best of 3: 14.2 s per loop

I've researched how to speed up the process and came across this post , however, when using UltraJSON the timing was just slightly worse: 我研究了如何加快处理过程,并发现了这篇文章 ,但是,在使用UltraJSON时,计时稍差一些:

In [3]: %timeit giant_list(traffic_files)
1 loop, best of 3: 16.3 s per loop

SimpleJSON did not do much better: SimpleJSON并没有做得更好:

In [4]: %timeit giant_list(traffic_files)
1 loop, best of 3: 16.3 s per loop

Any tips on how to optimize this code and more efficiently read a lot of JSON files into Python is much appreciated. 非常感谢任何有关如何优化此代码以及更有效地将大量JSON文件读入Python的技巧。

Finally, this post is the closest I've found to my question, but deals with one giant JSON file, not many smaller sized ones. 最后, 这篇文章是我找到的最接近我的问题的文章,但是涉及的是一个巨大的JSON文件,而不是许多较小的JSON文件。

Use list comprehension to avoid resizing list multiple times. 使用列表理解可以避免多次调整列表大小。

def giant_list(json_files):
    return [read_json_file(path) for path in json_files]

You are closing file object twice, simply do it once (on exiting with file would be closed automatically) 您正在关闭的文件对象两次,只需做一次(在退出with文件将被自动关闭)

def read_json_file(path_to_file):
    with open(path_to_file) as p:
        return json.load(p)

At the end of the day, your problem is I/O bound, but these changes will help a little bit. 归根结底,您的问题是受I / O约束的,但是这些更改会有所帮助。 Also, I have to ask - do you really have to have all these dictionaries in the memory at the same time? 另外,我不得不问-您是否真的必须同时将所有这些词典存储在内存中?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM