简体   繁体   English

Python计算列表中的出现次数:如何使其更快?

[英]Python counting occurrence in a list: how to make it faster?

I have a list of strings which contains about 6 millions items, and I am trying to count the occurrence for each of the unique values. 我有一个包含大约600万个项目的字符串列表,并且我试图计算每个唯一值的出现次数。

Here is my code: 这是我的代码:

lines = [6 million strings]
unique_val = list(set(lines))    # contains around 500k items

mydict = {}
for val in unique_val:
    mydict[val] = lines.count(val)

I've found the above code works very slow given that the list I am counting is huge. 考虑到我要计算的列表很大,我发现上述代码的运行速度非常慢。

I'm wondering if there is a way to make it faster? 我想知道是否有办法使其更快?

Many thanks 非常感谢

If you didn't want to use the collections module. 如果您不想使用collections模块。

counts = dict()
for line in lines:
    counts[line] = counts.get(line,0) + 1

Or if you just don't want to use Counter 或者,如果您不想使用Counter

from collection import defaultdict
counts = defaultdict(int)
for line in lines:
    counts[line] += 1

How about this, 这个怎么样,

from collections import defaultdict
import collections

lines = [600 million strings]

d = defaultdict(int)
for line in lines:
    for word, count in collections.Counter(line).items():
        d[word] += count

Numpy Solution 脾气暴躁的解决方案

I think numpy will give you the fastest answer, using unique : 我认为numpy使用unique会给您最快的答案:

result = dict(zip(*np.unique(lines, return_counts=True)))

Numpy is pretty heavily optimized under the hood. Numpy在引擎盖下进行了大量优化。 Per the linked docs, the magic circles around the return_counts flag: 根据链接的文档,魔术圈围绕return_counts标志:

return_counts : bool, optional return_counts :布尔值,可选

If True, also return the number of times each unique value comes up in ar. 如果为True,则还返回ar中每个唯一值出现的次数。


Timing 定时

I timed your original approach, the counter approach 我选择了您最初的方法,反方法

result = Counter(lines)

and the numpy approach on a set generated by 和由生成的集合上的numpy方法

N = 1000000
lines = [chr(i%100) for i in range(N) ]

Obviously, that test isn't great coverage, but it's a start. 显然,该测试的覆盖面不是很大,但这只是一个开始。

You're approach operated in 0.584s; 您的进近速度为0.584秒; DeepSpace's Counter in 0.162 ( 3.5x speedup ), and numpy in 0.0861 ( 7x speedup ). DeepSpace的Counter为0.162( 3.5倍加速 ),numpy为0.0861( 7倍加速 )。 Again, this may depend on a lot of factor's including the type of data you have: the conclusion may be that either numpy or a Counter will provide a speedup, with a counter not requiring an external library 同样,这可能取决于很多因素,包括您拥有的数据类型:结论可能是numpy或Counter将提供加速,而counter不需要外部库

Calling list.count is very expensive. 调用list.count非常昂贵。 Dictionary access (O(1) amortized time) and the in operator however are relatively cheap. 字典访问(O(1)摊销时间)和in运算符相对便宜。 The following snippet shows much better time complexity. 以下代码片段显示了更好的时间复杂度。

def stats(lines):
    histogram = {}
    for s in lines:
        if s in histogram:
            histogram[s] += 1
        else:
            histogram[s] = 1
    return histogram

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM