简体   繁体   English

如何在大型JSON文件中查找唯一值?

[英]How to find unique values in a large JSON file?

I've 2 json files of size data_large(150.1mb) and data_small(7.5kb) . 我有两个大小为data_large(150.1mb)data_small(7.5kb) json文件。 The content inside each file is of type [{"score": 68},{"score": 78}] . 每个文件中的内容类型为[{"score": 68},{"score": 78}] I need to find the list of unique scores from each file. 我需要找到每个文件的唯一分数列表。

While dealing with data_small , I did the following and I was able to view its content with 0.1 secs . 在处理data_small时 ,我做了以下操作,并且能够以0.1 secs查看其内容。

with open('data_small') as f:
    content = json.load(f)

print content # I'll be applying the logic to find the unique values later.

But while dealing with data_large , I did the following and my system got hanged, slow and had to force shut-it down to bring it into its normal speed. 但是在处理data_large时 ,我做了以下操作,我的系统被绞死,缓慢,不得不强制关闭它以使其达到正常速度。 It took around 2 mins to print its content. 打印其内容大约需要2 mins

with open('data_large') as f:
    content = json.load(f)

print content # I'll be applying the logic to find the unique values later.

How can I increase the efficiency of the program while dealing with large data-sets? 在处理大型数据集时如何提高程序的效率?

Since your json file is not that large and you can afford to open it into ram all at once, you can get all unique values like: 由于您的json文件不是那么大,您可以一次性将它打开到ram中,您可以获得所有独特的值,如:

with open('data_large') as f:
    content = json.load(f)

# do not print content since it prints it to stdout which will be pretty slow

# get the unique values
values = set()
for item in content:
    values.add(item['score'])

# the above uses less memory compared to this
# since this has to create another array with all values
# and then filter it for unique values
values = set([i['score'] for i in content])

# its faster to save the results to a file rather than print them
with open('results.json', 'wb') as fid:
    # json cant serialize sets hence conversion to list
    json.dump(list(values), fid)

If you will need to process even bigger files, then look for other libraries which can parse a json file iteratively. 如果您需要处理更大的文件,那么请寻找可以迭代解析json文件的其他库。

If you want to iterate over the JSON file in smaller chunks to preserve RAM, I suggest the approach below, based on your comment that you did not want to use ijson to do this. 如果你想在较小的块中迭代JSON文件以保留RAM,我建议采用下面的方法,根据你的评论,你不想使用ijson来做到这一点。 This only works because your sample input data is so simple and consists of an array of dictionaries with one key and one value. 这只能起作用,因为您的示例输入数据非常简单,并且包含一个带有一个键和一个值的字典数组。 It would get complicated with more complex data, and I would go with an actual JSON streaming library at that point. 对于更复杂的数据,它会变得复杂,我会在那时使用实际的JSON流库。

import json

bytes_to_read = 10000
unique_scores = set()

with open('tmp.txt') as f:
chunk = f.read(bytes_to_read)
while chunk:
    # Find indices of dictionaries in chunk
    if '{' not in chunk:
        break
    opening = chunk.index('{')
    ending = chunk.rindex('}')

    # Load JSON and set scores.
    score_dicts = json.loads('[' + chunk[opening:ending+1] + ']')
    for s in score_dicts:
        unique_scores.add(s.values()[0])

    # Read next chunk from last processed dict.
    f.seek(-(len(chunk) - ending) + 1, 1)
    chunk = f.read(bytes_to_read)
print unique_scores

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM