简体   繁体   中英

Python counting occurrence in a list: how to make it faster?

I have a list of strings which contains about 6 millions items, and I am trying to count the occurrence for each of the unique values.

Here is my code:

lines = [6 million strings]
unique_val = list(set(lines))    # contains around 500k items

mydict = {}
for val in unique_val:
    mydict[val] = lines.count(val)

I've found the above code works very slow given that the list I am counting is huge.

I'm wondering if there is a way to make it faster?

Many thanks

If you didn't want to use the collections module.

counts = dict()
for line in lines:
    counts[line] = counts.get(line,0) + 1

Or if you just don't want to use Counter

from collection import defaultdict
counts = defaultdict(int)
for line in lines:
    counts[line] += 1

How about this,

from collections import defaultdict
import collections

lines = [600 million strings]

d = defaultdict(int)
for line in lines:
    for word, count in collections.Counter(line).items():
        d[word] += count

Numpy Solution

I think numpy will give you the fastest answer, using unique :

result = dict(zip(*np.unique(lines, return_counts=True)))

Numpy is pretty heavily optimized under the hood. Per the linked docs, the magic circles around the return_counts flag:

return_counts : bool, optional

If True, also return the number of times each unique value comes up in ar.


Timing

I timed your original approach, the counter approach

result = Counter(lines)

and the numpy approach on a set generated by

N = 1000000
lines = [chr(i%100) for i in range(N) ]

Obviously, that test isn't great coverage, but it's a start.

You're approach operated in 0.584s; DeepSpace's Counter in 0.162 ( 3.5x speedup ), and numpy in 0.0861 ( 7x speedup ). Again, this may depend on a lot of factor's including the type of data you have: the conclusion may be that either numpy or a Counter will provide a speedup, with a counter not requiring an external library

Calling list.count is very expensive. Dictionary access (O(1) amortized time) and the in operator however are relatively cheap. The following snippet shows much better time complexity.

def stats(lines):
    histogram = {}
    for s in lines:
        if s in histogram:
            histogram[s] += 1
        else:
            histogram[s] = 1
    return histogram

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM