简体   繁体   English

Python:使用dict理解/生成器计算列表中的出现次数

[英]Python: count occurrences in a list using dict comprehension/generator

I want to write some tests to analyse the efficiency of different operations in python, namely a comparison of dictionary comprehensions and dict generators. 我想编写一些测试来分析python中不同操作的效率,即字典理解和字典生成器的比较。

To test this out, I thought I would try a simple example: count the number of words in a list using dictionaries. 为了测试这个,我想我会尝试一个简单的例子:使用字典计算列表中的单词数。

Now I know that you can do this using collections.Counter (as per an answer here: How can I count the occurrences of a list item in Python? ), but my objective was to test performance an memory. 现在我知道你可以使用collections.Counter来做到这一点(根据这里的答案: 我如何计算Python中列表项的出现次数? ),但我的目标是测试性能的内存。

One "long-hand" way is to do it in a basic loop. 一种“长手”方式是在基本循环中完成。

from pprint import pprint

# Read in some text to create example data
with open('text.txt') as f:
    words = f.read().split()

dict1 = {}
for w in words:
    if not dict1.get(w):
        dict1[w] = 1
    else:
        dict1[w] += 1
pprint(dict1)

The result: 结果:

{'a': 62,
 'aback': 1,
 'able': 1,
 'abolished': 2,
 'about': 6,
 'accept': 1,
 'accepted': 1,
 'accord': 1,
 'according': 1,
 'across': 1,
 ...

Then I got a bit stuck trying to do the same in a dictionary comprehension: 然后我在字典理解中尝试做同样的事情时遇到了一些困难:

dict2  = { w: 1 if not dict2.get(w) else dict2.get(w) + 1
            for w in words }

I got an error: 我收到一个错误:

NameError: global name 'dict2' is not defined

I tried defining the dict up front: 我试着预先定义dict:

dict2 = {}
dict2  = { w: 1 if not dict2.get(w) else dict2.get(w) + 1
            for w in words }
pprint(dict2)

But of course the counts are all set to 1: 但当然计数都设为1:

{'a': 1,
 'aback': 1,
 'able': 1,
 'abolished': 1,
 'about': 1,
 'accept': 1,
 'accepted': 1,
 'accord': 1,
 'according': 1,
 'across': 1,
 ...

I had a similar problem with dict comprehension: 我对dict理解有类似的问题:

dict3 = dict( (w, 1 if not dict2.get(w) else dict2.get(w) + 1)
                for w in words)

So my question is: how can I use a dictionary comprehension/generator most efficiently to count the number of occurrences in a list? 所以我的问题是:如何最有效地使用字典理解/生成器来计算列表中出现的次数?

Update : @Rawing suggested an alternative approach {word:words.count(word) for word in set(words)} but that would circumvent the mechanism I am trying to test. 更新 :@Rawing提出了另一种方法{word:words.count(word) for word in set(words)} :words.count {word:words.count(word) for word in set(words)}但这会绕过我试图测试的机制。

You cannot do this efficiently(at least in terms of memory) using a dict-comprehension, because then you'll have to keep track of current count in another dictionary ie more memory consumption. 你不能使用dict-comprehension有效地(至少在内存方面)这样做,因为那时你必须跟踪另一个字典中的当前计数,即更多的内存消耗。 Here's how you can do it using a dict-comprehension(not recommended at all :-)): 这是你如何使用字典理解(完全不推荐:-))来做到这一点:

>>> words = list('asdsadDASDFASCSAASAS')
>>> dct = {}
>>> {w: 1 if w not in dct and not dct.update({w: 1})
                  else dct[w] + 1
                  if not dct.update({w: dct[w] + 1}) else 1 for w in words}
>>> dct
{'a': 2, 'A': 5, 's': 2, 'd': 2, 'F': 1, 'C': 1, 'S': 5, 'D': 2}

Another way will be to sort the words list first then group them using itertools.groupby and then count the length of each group. 另一种方法是首先对单词列表进行排序,然后使用itertools.groupby它们进行分组,然后计算每个组的长度。 Here the dict-comprehension can be converted to a generator if you want, but yes this will require reading all words in memory first: 如果你愿意,可以将dict-comprehension转换为生成器,但是这需要首先读取内存中的所有单词:

from itertools import groupby
words.sort()
dct = {k: sum(1 for _ in g) for k, g in groupby(words)}

Note that the fastest one of the lot is collections.defaultdict : 请注意,该批次中最快的一个collections.defaultdict

d = defaultdict(int)
for w in words: d[w] += 1 

Timing comparisons: 时间比较:

>>> from string import ascii_letters, digits
>>> %timeit words = list(ascii_letters+digits)*10**4; words.sort(); {k: sum(1 for _ in g) for k, g in groupby(words)}
10 loops, best of 3: 131 ms per loop
>>> %timeit words = list(ascii_letters+digits)*10**4; Counter(words)
10 loops, best of 3: 169 ms per loop
>>> %timeit words = list(ascii_letters+digits)*10**4; dct = {}; {w: 1 if w not in dct and not dct.update({w: 1}) else dct[w] + 1 if not dct.update({w: dct[w] + 1}) else 1 for w in words}
1 loops, best of 3: 315 ms per loop
>>> %%timeit
... words = list(ascii_letters+digits)*10**4
... d = defaultdict(int)
... for w in words: d[w] += 1
... 
10 loops, best of 3: 57.1 ms per loop
>>> %%timeit
words = list(ascii_letters+digits)*10**4
d = {}
for w in words: d[w] = d.get(w, 0) + 1
... 
10 loops, best of 3: 108 ms per loop

#Increase input size 

>>> %timeit words = list(ascii_letters+digits)*10**5; words.sort(); {k: sum(1 for _ in g) for k, g in groupby(words)}
1 loops, best of 3: 1.44 s per loop
>>> %timeit words = list(ascii_letters+digits)*10**5; Counter(words)
1 loops, best of 3: 1.7 s per loop
>>> %timeit words = list(ascii_letters+digits)*10**5; dct = {}; {w: 1 if w not in dct and not dct.update({w: 1}) else dct[w] + 1 if not dct.update({w: dct[w] + 1}) else 1 for w in words}

1 loops, best of 3: 3.19 s per loop
>>> %%timeit
words = list(ascii_letters+digits)*10**5
d = defaultdict(int)
for w in words: d[w] += 1
... 
1 loops, best of 3: 571 ms per loop
>>> %%timeit
words = list(ascii_letters+digits)*10**5
d = {}
for w in words: d[w] = d.get(w, 0) + 1
... 
1 loops, best of 3: 1.1 s per loop

You can do it this way: 你可以这样做:

>>> words=['this','that','is','if','that','is','if','this','that']
>>> {i:words.count(i) for i in words}
{'this': 2, 'is': 2, 'if': 2, 'that': 3}

It is a use case where comprehension is not adapted/efficient. 这是一种理解不适应/有效的用例。

Comprehension is good when you can build the collection in one single operation. 当您可以在一次操作中构建集合时,理解就很好。 It is not really the case here, since : 事实并非如此,因为:

  • either you take the words as they come and change values in the dict accordingly 你要么在他们来的时候接受这些词并相应地改变 dict中的
  • or you have to first compute the key set (Rawing solution), but then you browse the list once for getting the key set, and once per key 或者你必须先计算密钥集(Rawing解决方案),然后你浏览列表一次以获取密钥集,每个密钥一次

IMHO, the most efficient way is the iterative one. 恕我直言,最有效的方法是迭代的方式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM