I have a list of strings which contains about 6 millions items, and I am trying to count the occurrence for each of the unique values.
Here is my code:
lines = [6 million strings]
unique_val = list(set(lines)) # contains around 500k items
mydict = {}
for val in unique_val:
mydict[val] = lines.count(val)
I've found the above code works very slow given that the list I am counting is huge.
I'm wondering if there is a way to make it faster?
Many thanks
If you didn't want to use the collections
module.
counts = dict()
for line in lines:
counts[line] = counts.get(line,0) + 1
Or if you just don't want to use Counter
from collection import defaultdict
counts = defaultdict(int)
for line in lines:
counts[line] += 1
How about this,
from collections import defaultdict
import collections
lines = [600 million strings]
d = defaultdict(int)
for line in lines:
for word, count in collections.Counter(line).items():
d[word] += count
Numpy Solution
I think numpy will give you the fastest answer, using unique :
result = dict(zip(*np.unique(lines, return_counts=True)))
Numpy is pretty heavily optimized under the hood. Per the linked docs, the magic circles around the return_counts
flag:
return_counts
: bool, optionalIf True, also return the number of times each unique value comes up in ar.
Timing
I timed your original approach, the counter approach
result = Counter(lines)
and the numpy approach on a set generated by
N = 1000000
lines = [chr(i%100) for i in range(N) ]
Obviously, that test isn't great coverage, but it's a start.
You're approach operated in 0.584s; DeepSpace's Counter in 0.162 ( 3.5x speedup ), and numpy in 0.0861 ( 7x speedup ). Again, this may depend on a lot of factor's including the type of data you have: the conclusion may be that either numpy or a Counter will provide a speedup, with a counter not requiring an external library
Calling list.count
is very expensive. Dictionary access (O(1) amortized time) and the in
operator however are relatively cheap. The following snippet shows much better time complexity.
def stats(lines):
histogram = {}
for s in lines:
if s in histogram:
histogram[s] += 1
else:
histogram[s] = 1
return histogram
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.