[英]Python: count occurrences in a list using dict comprehension/generator
I want to write some tests to analyse the efficiency of different operations in python, namely a comparison of dictionary comprehensions and dict generators. 我想编写一些测试来分析python中不同操作的效率,即字典理解和字典生成器的比较。
To test this out, I thought I would try a simple example: count the number of words in a list using dictionaries. 为了测试这个,我想我会尝试一个简单的例子:使用字典计算列表中的单词数。
Now I know that you can do this using collections.Counter
(as per an answer here: How can I count the occurrences of a list item in Python? ), but my objective was to test performance an memory. 现在我知道你可以使用collections.Counter
来做到这一点(根据这里的答案: 我如何计算Python中列表项的出现次数? ),但我的目标是测试性能的内存。
One "long-hand" way is to do it in a basic loop. 一种“长手”方式是在基本循环中完成。
from pprint import pprint
# Read in some text to create example data
with open('text.txt') as f:
words = f.read().split()
dict1 = {}
for w in words:
if not dict1.get(w):
dict1[w] = 1
else:
dict1[w] += 1
pprint(dict1)
The result: 结果:
{'a': 62,
'aback': 1,
'able': 1,
'abolished': 2,
'about': 6,
'accept': 1,
'accepted': 1,
'accord': 1,
'according': 1,
'across': 1,
...
Then I got a bit stuck trying to do the same in a dictionary comprehension: 然后我在字典理解中尝试做同样的事情时遇到了一些困难:
dict2 = { w: 1 if not dict2.get(w) else dict2.get(w) + 1
for w in words }
I got an error: 我收到一个错误:
NameError: global name 'dict2' is not defined
I tried defining the dict up front: 我试着预先定义dict:
dict2 = {}
dict2 = { w: 1 if not dict2.get(w) else dict2.get(w) + 1
for w in words }
pprint(dict2)
But of course the counts are all set to 1: 但当然计数都设为1:
{'a': 1,
'aback': 1,
'able': 1,
'abolished': 1,
'about': 1,
'accept': 1,
'accepted': 1,
'accord': 1,
'according': 1,
'across': 1,
...
I had a similar problem with dict comprehension: 我对dict理解有类似的问题:
dict3 = dict( (w, 1 if not dict2.get(w) else dict2.get(w) + 1)
for w in words)
So my question is: how can I use a dictionary comprehension/generator most efficiently to count the number of occurrences in a list? 所以我的问题是:如何最有效地使用字典理解/生成器来计算列表中出现的次数?
Update : @Rawing suggested an alternative approach {word:words.count(word) for word in set(words)}
but that would circumvent the mechanism I am trying to test. 更新 :@Rawing提出了另一种方法{word:words.count(word) for word in set(words)}
:words.count {word:words.count(word) for word in set(words)}
但这会绕过我试图测试的机制。
You cannot do this efficiently(at least in terms of memory) using a dict-comprehension, because then you'll have to keep track of current count in another dictionary ie more memory consumption. 你不能使用dict-comprehension有效地(至少在内存方面)这样做,因为那时你必须跟踪另一个字典中的当前计数,即更多的内存消耗。 Here's how you can do it using a dict-comprehension(not recommended at all :-)): 这是你如何使用字典理解(完全不推荐:-))来做到这一点:
>>> words = list('asdsadDASDFASCSAASAS')
>>> dct = {}
>>> {w: 1 if w not in dct and not dct.update({w: 1})
else dct[w] + 1
if not dct.update({w: dct[w] + 1}) else 1 for w in words}
>>> dct
{'a': 2, 'A': 5, 's': 2, 'd': 2, 'F': 1, 'C': 1, 'S': 5, 'D': 2}
Another way will be to sort the words list first then group them using itertools.groupby
and then count the length of each group. 另一种方法是首先对单词列表进行排序,然后使用itertools.groupby
它们进行分组,然后计算每个组的长度。 Here the dict-comprehension can be converted to a generator if you want, but yes this will require reading all words in memory first: 如果你愿意,可以将dict-comprehension转换为生成器,但是这需要首先读取内存中的所有单词:
from itertools import groupby
words.sort()
dct = {k: sum(1 for _ in g) for k, g in groupby(words)}
Note that the fastest one of the lot is collections.defaultdict
: 请注意,该批次中最快的一个是collections.defaultdict
:
d = defaultdict(int)
for w in words: d[w] += 1
Timing comparisons: 时间比较:
>>> from string import ascii_letters, digits
>>> %timeit words = list(ascii_letters+digits)*10**4; words.sort(); {k: sum(1 for _ in g) for k, g in groupby(words)}
10 loops, best of 3: 131 ms per loop
>>> %timeit words = list(ascii_letters+digits)*10**4; Counter(words)
10 loops, best of 3: 169 ms per loop
>>> %timeit words = list(ascii_letters+digits)*10**4; dct = {}; {w: 1 if w not in dct and not dct.update({w: 1}) else dct[w] + 1 if not dct.update({w: dct[w] + 1}) else 1 for w in words}
1 loops, best of 3: 315 ms per loop
>>> %%timeit
... words = list(ascii_letters+digits)*10**4
... d = defaultdict(int)
... for w in words: d[w] += 1
...
10 loops, best of 3: 57.1 ms per loop
>>> %%timeit
words = list(ascii_letters+digits)*10**4
d = {}
for w in words: d[w] = d.get(w, 0) + 1
...
10 loops, best of 3: 108 ms per loop
#Increase input size
>>> %timeit words = list(ascii_letters+digits)*10**5; words.sort(); {k: sum(1 for _ in g) for k, g in groupby(words)}
1 loops, best of 3: 1.44 s per loop
>>> %timeit words = list(ascii_letters+digits)*10**5; Counter(words)
1 loops, best of 3: 1.7 s per loop
>>> %timeit words = list(ascii_letters+digits)*10**5; dct = {}; {w: 1 if w not in dct and not dct.update({w: 1}) else dct[w] + 1 if not dct.update({w: dct[w] + 1}) else 1 for w in words}
1 loops, best of 3: 3.19 s per loop
>>> %%timeit
words = list(ascii_letters+digits)*10**5
d = defaultdict(int)
for w in words: d[w] += 1
...
1 loops, best of 3: 571 ms per loop
>>> %%timeit
words = list(ascii_letters+digits)*10**5
d = {}
for w in words: d[w] = d.get(w, 0) + 1
...
1 loops, best of 3: 1.1 s per loop
You can do it this way: 你可以这样做:
>>> words=['this','that','is','if','that','is','if','this','that']
>>> {i:words.count(i) for i in words}
{'this': 2, 'is': 2, 'if': 2, 'that': 3}
It is a use case where comprehension is not adapted/efficient. 这是一种理解不适应/有效的用例。
Comprehension is good when you can build the collection in one single operation. 当您可以在一次操作中构建集合时,理解就很好。 It is not really the case here, since : 事实并非如此,因为:
IMHO, the most efficient way is the iterative one. 恕我直言,最有效的方法是迭代的方式。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.