Python counting occurrence in a list: how to make it faster?

Question

I have a list of strings which contains about 6 millions items, and I am trying to count the occurrence for each of the unique values.

Here is my code:

lines = [6 million strings]
unique_val = list(set(lines))    # contains around 500k items

mydict = {}
for val in unique_val:
    mydict[val] = lines.count(val)

I've found the above code works very slow given that the list I am counting is huge.

I'm wondering if there is a way to make it faster?

Many thanks

Answer 1

If you didn't want to use the collections module.

counts = dict()
for line in lines:
    counts[line] = counts.get(line,0) + 1

Or if you just don't want to use Counter

from collection import defaultdict
counts = defaultdict(int)
for line in lines:
    counts[line] += 1

Answer 2

How about this,

from collections import defaultdict
import collections

lines = [600 million strings]

d = defaultdict(int)
for line in lines:
    for word, count in collections.Counter(line).items():
        d[word] += count

Answer 3

Numpy Solution

I think numpy will give you the fastest answer, using unique :

result = dict(zip(*np.unique(lines, return_counts=True)))

Numpy is pretty heavily optimized under the hood. Per the linked docs, the magic circles around the return_counts flag:

return_counts : bool, optional

If True, also return the number of times each unique value comes up in ar.

Timing

I timed your original approach, the counter approach

result = Counter(lines)

and the numpy approach on a set generated by

N = 1000000
lines = [chr(i%100) for i in range(N) ]

Obviously, that test isn't great coverage, but it's a start.

You're approach operated in 0.584s; DeepSpace's Counter in 0.162 ( 3.5x speedup ), and numpy in 0.0861 ( 7x speedup ). Again, this may depend on a lot of factor's including the type of data you have: the conclusion may be that either numpy or a Counter will provide a speedup, with a counter not requiring an external library

Answer 4

Calling list.count is very expensive. Dictionary access (O(1) amortized time) and the in operator however are relatively cheap. The following snippet shows much better time complexity.

def stats(lines):
    histogram = {}
    for s in lines:
        if s in histogram:
            histogram[s] += 1
        else:
            histogram[s] = 1
    return histogram

Python counting occurrence in a list: how to make it faster?

Question

4 answers

solution1
1 2016-06-13 14:13:06

solution2
1 2016-06-13 14:13:54

solution3
1 2016-06-13 14:14:56

solution4
1 2016-06-13 14:29:12

Python counting occurrence in a list: how to make it faster?

Question

4 answers

solution1 1 2016-06-13 14:13:06

solution2 1 2016-06-13 14:13:54

solution3 1 2016-06-13 14:14:56

solution4 1 2016-06-13 14:29:12

solution1
1 2016-06-13 14:13:06

solution2
1 2016-06-13 14:13:54

solution3
1 2016-06-13 14:14:56

solution4
1 2016-06-13 14:29:12