A big txt file has millions of lines, that I want to count the number of occurrence of each line (how many times the line appear in the file).
Current solution I am using, is as below. It works but very slow.
What is the better way to do so? Thank you!
from collections import Counter
crimefile = open("C:\\temp\\large text_file.txt", 'r', encoding = 'utf-8')
yourResult = [line.strip().split('\n') for line in crimefile.readlines()]
yourResult = sum(yourResult, [])
result = dict((i, yourResult.count(i)) for i in yourResult)
output = sorted((value,key) for (key,value) in result.items())
print (Counter(yourResult))
Using a defaultdict
and looping over lines, instead of reading everything to memory.
counter = defaultdict(lambda: 0)
with open("C:\\temp\\large text_file.txt", "r", encoding="utf-8") as file:
for line in file:
counter[line.strip()] += 1
counter = dict(counter)
print(counter)
Tested with timeit
and 10k lines of text, roughly 40x faster on my machine.
We can use a single for loop to do this. We don't have to strip the new line character because every line will have it.
Solution
counter = {}
with open('filename/path', 'r', encoding='utf-8') as file:
for line in file:
if line not in counter:
counter[line] = 1
else:
counter[line] += 1
print(counter)
Time Complexity
O(n)
You do not qualify what you mean by "very slow".
I have a text file comprised of 2.5 million different lines which can be processed in ~1.3s as follows:
from timeit import timeit
FILENAME = '/Volumes/G-Drive/foo.txt'
def get_counts():
d = {}
line_count = 0
with open(FILENAME) as f:
for line in map(str.strip, f):
d[line] = d.get(line, 0) + 1
line_count += 1
key_count = len(d)
print(f'{line_count=}, {key_count=}')
return d
print(timeit(get_counts, number=1))
Output:
line_count=2500000, key_count=2500000
1.2901516660003836
Notes:
You could use Counter or defaultdict from the collections module but they are both slower than the strategy shown in this answer.
As I understand the required functionality you probably don't need to strip the lines. If you omit that you could see a further improvement of ~12%
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.