简体   繁体   中英

How to find unique values in a large JSON file?

I've 2 json files of size data_large(150.1mb) and data_small(7.5kb) . The content inside each file is of type [{"score": 68},{"score": 78}] . I need to find the list of unique scores from each file.

While dealing with data_small , I did the following and I was able to view its content with 0.1 secs .

with open('data_small') as f:
    content = json.load(f)

print content # I'll be applying the logic to find the unique values later.

But while dealing with data_large , I did the following and my system got hanged, slow and had to force shut-it down to bring it into its normal speed. It took around 2 mins to print its content.

with open('data_large') as f:
    content = json.load(f)

print content # I'll be applying the logic to find the unique values later.

How can I increase the efficiency of the program while dealing with large data-sets?

Since your json file is not that large and you can afford to open it into ram all at once, you can get all unique values like:

with open('data_large') as f:
    content = json.load(f)

# do not print content since it prints it to stdout which will be pretty slow

# get the unique values
values = set()
for item in content:
    values.add(item['score'])

# the above uses less memory compared to this
# since this has to create another array with all values
# and then filter it for unique values
values = set([i['score'] for i in content])

# its faster to save the results to a file rather than print them
with open('results.json', 'wb') as fid:
    # json cant serialize sets hence conversion to list
    json.dump(list(values), fid)

If you will need to process even bigger files, then look for other libraries which can parse a json file iteratively.

If you want to iterate over the JSON file in smaller chunks to preserve RAM, I suggest the approach below, based on your comment that you did not want to use ijson to do this. This only works because your sample input data is so simple and consists of an array of dictionaries with one key and one value. It would get complicated with more complex data, and I would go with an actual JSON streaming library at that point.

import json

bytes_to_read = 10000
unique_scores = set()

with open('tmp.txt') as f:
chunk = f.read(bytes_to_read)
while chunk:
    # Find indices of dictionaries in chunk
    if '{' not in chunk:
        break
    opening = chunk.index('{')
    ending = chunk.rindex('}')

    # Load JSON and set scores.
    score_dicts = json.loads('[' + chunk[opening:ending+1] + ']')
    for s in score_dicts:
        unique_scores.add(s.values()[0])

    # Read next chunk from last processed dict.
    f.seek(-(len(chunk) - ending) + 1, 1)
    chunk = f.read(bytes_to_read)
print unique_scores

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM