简体   繁体   English

Python-有效地从大型json文件中找到唯一值

[英]Python- find the unique values from a large json file efficienctly

I've a json file data_large of size 150.1MB. 我有一个json文件data_large ,大小为150.1MB。 The content inside the file is of type [{"score": 68},{"score": 78}] . 文件内的内容的类型为[{"score": 68},{"score": 78}] I need to find the list of unique scores from each item. 我需要找到每个项目的唯一分数列表。

This is what I'm doing:- 这就是我在做什么:-

import ijson  # since json file is large, hence making use of ijson

f = open ('data_large')
content = ijson.items(f, 'item') # json loads quickly here as compared to when json.load(f) is used.
print set(i['score'] for i in content) #this line is actually taking a long time to get processed.

Can I make print set(i['score'] for i in content) line more efficient. 我可以使print set(i['score'] for i in content)效率更高print set(i['score'] for i in content) Currently it's taking 201secs to execute. 目前需要201秒才能执行。 Can it be made more efficient? 可以提高效率吗?

This will give you the set of unique score values (only) as ints. 这将为您提供一组唯一的分数值(仅)(以整数为单位)。 You'll need the 150 MB of free memory. 您需要150 MB的可用内存。 It uses re.finditer() to parse which is about three times faster than the json parser (on my computer). 它使用re.finditer()进行解析,这比json解析器(在我的计算机上)快大约三倍。

import re
import time
t = time.time()
obj = re.compile('{.*?: (\d*?)}')
with open('datafile.txt', 'r') as f:
    data = f.read()
s = set(m.group(1) for m in obj.finditer(data))
s = set(map(int, s))
print time.time() - t

Using re.findall() also seems to be about three times faster than the json parser, it consumes about 260 MB: 使用re.findall()似乎也比json解析器快大约三倍,它消耗约260 MB:

import re
obj = re.compile('{.*?: (\d*?)}')
with open('datafile.txt', 'r') as f:
    data = f.read()
s = set(obj.findall(data))

I don't think there is any way to improve things by much. 我认为没有什么办法可以改善很多事情。 The slow part is probably just the fact that at some point you need to parse the whole JSON file. 最慢的部分可能只是在某些时候您需要解析整个JSON文件这一事实。 Whether you do it all up front (with json.load ) or little by little (when consuming the generator from ijson.items ), the whole file needs to be processed eventually. 无论你做这一切在前面(与json.load )或一点一点地(消耗来自发电机时ijson.items ),整个文件需要最终处理。

The advantage to using ijson is that you only need to have a small amount of data in memory at any given time. 使用ijson的优点是,在任何给定时间,您只需要在内存中存储少量数据即可。 This probably doesn't matter too much for a file with a hundred or so megabytes of data, but would be a very big deal if your data file grew to be gigabytes or more. 对于具有大约100兆字节数据的文件来说,这可能无关紧要,但是如果您的数据文件增长到千兆字节或更大,这将是非常大的事情。 Of course, this may also depend on the hardware you're running on. 当然,这也可能取决于您所运行的硬件。 If your code is going to run on an embedded system with limited RAM, limiting your memory use is much more important. 如果您的代码要在内存有限的嵌入式系统上运行,那么限制内存使用就显得尤为重要。 On the other hand, if it is going to be running on a high performance server or workstation with lots and lots of ram available, there's may not be any reason to hold back. 另一方面,如果要在具有大量可用RAM的高性能服务器或工作站上运行,则可能没有任何理由推迟。

So, if you don't expect your data to get too big (relative to your system's RAM capacity), you might try testing to see if using json.load to read the whole file at the start, then getting the unique values with a set is faster. 因此,如果您不希望数据过大(相对于系统的RAM容量),则可以尝试进行测试以查看是否从一开始就使用json.load读取了整个文件,然后使用set更快。 I don't think there are any other obvious shortcuts. 我认为没有其他明显的捷径。

On my system, the straightforward code below handles 10,000,000 scores (139 megabytes) in 18 seconds. 在我的系统上,下面的简单代码可在18秒内处理10,000,000个分数(139兆字节)。 Is that too slow? 那太慢了吗?

#!/usr/local/cpython-2.7/bin/python

from __future__ import print_function

import json  # since json file is large, hence making use of ijson

with open('data_large', 'r') as file_:
    content = json.load(file_)
    print(set(element['score'] for element in content))

Try using a set 尝试使用一套

set([x['score'] for x in scores])

For example 例如

>>> scores = [{"score" : 78}, {"score": 65} , {"score" : 65}]
>>> set([x['score'] for x in scores])
set([65, 78])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM