如何在大型JSON文件中查找唯一值？

Question

我有两个大小为data_large(150.1mb)和data_small(7.5kb) json文件。 每个文件中的内容类型为[{"score": 68},{"score": 78}] 。 我需要找到每个文件的唯一分数列表。

在处理data_small时 ，我做了以下操作，并且能够以0.1 secs查看其内容。

with open('data_small') as f:
    content = json.load(f)

print content # I'll be applying the logic to find the unique values later.

但是在处理data_large时 ，我做了以下操作，我的系统被绞死，缓慢，不得不强制关闭它以使其达到正常速度。 打印其内容大约需要2 mins 。

with open('data_large') as f:
    content = json.load(f)

print content # I'll be applying the logic to find the unique values later.

在处理大型数据集时如何提高程序的效率？

Answer 1

由于您的json文件不是那么大，您可以一次性将它打开到ram中，您可以获得所有独特的值，如：

with open('data_large') as f:
    content = json.load(f)

# do not print content since it prints it to stdout which will be pretty slow

# get the unique values
values = set()
for item in content:
    values.add(item['score'])

# the above uses less memory compared to this
# since this has to create another array with all values
# and then filter it for unique values
values = set([i['score'] for i in content])

# its faster to save the results to a file rather than print them
with open('results.json', 'wb') as fid:
    # json cant serialize sets hence conversion to list
    json.dump(list(values), fid)

如果您需要处理更大的文件，那么请寻找可以迭代解析json文件的其他库。

Answer 2

如果你想在较小的块中迭代JSON文件以保留RAM，我建议采用下面的方法，根据你的评论，你不想使用ijson来做到这一点。 这只能起作用，因为您的示例输入数据非常简单，并且包含一个带有一个键和一个值的字典数组。 对于更复杂的数据，它会变得复杂，我会在那时使用实际的JSON流库。

import json

bytes_to_read = 10000
unique_scores = set()

with open('tmp.txt') as f:
chunk = f.read(bytes_to_read)
while chunk:
    # Find indices of dictionaries in chunk
    if '{' not in chunk:
        break
    opening = chunk.index('{')
    ending = chunk.rindex('}')

    # Load JSON and set scores.
    score_dicts = json.loads('[' + chunk[opening:ending+1] + ']')
    for s in score_dicts:
        unique_scores.add(s.values()[0])

    # Read next chunk from last processed dict.
    f.seek(-(len(chunk) - ending) + 1, 1)
    chunk = f.read(bytes_to_read)
print unique_scores

如何在大型JSON文件中查找唯一值？

问题描述

2 个解决方案

解决方案1
3 2014-01-04 08:38:59

解决方案2
0 2014-01-04 08:53:19

如何在大型JSON文件中查找唯一值？

问题描述

2 个解决方案

解决方案1 3 2014-01-04 08:38:59

解决方案2 0 2014-01-04 08:53:19

解决方案1
3 2014-01-04 08:38:59

解决方案2
0 2014-01-04 08:53:19