简体   繁体   中英

Python to count number of occurrence of each line, in huge txt file

A big txt file has millions of lines, that I want to count the number of occurrence of each line (how many times the line appear in the file).

在此处输入图像描述

Current solution I am using, is as below. It works but very slow.

What is the better way to do so? Thank you!

from collections import Counter

crimefile = open("C:\\temp\\large text_file.txt", 'r', encoding = 'utf-8')
yourResult = [line.strip().split('\n') for line in crimefile.readlines()]

yourResult = sum(yourResult, [])

result = dict((i, yourResult.count(i)) for i in yourResult)
output = sorted((value,key) for (key,value) in result.items())

print (Counter(yourResult))

Using a defaultdict and looping over lines, instead of reading everything to memory.

counter = defaultdict(lambda: 0)
with open("C:\\temp\\large text_file.txt", "r", encoding="utf-8") as file:
    for line in file:
        counter[line.strip()] += 1
counter = dict(counter)
print(counter)

Tested with timeit and 10k lines of text, roughly 40x faster on my machine.

We can use a single for loop to do this. We don't have to strip the new line character because every line will have it.

Solution

counter = {}
with open('filename/path', 'r', encoding='utf-8') as file:
    for line in file:
        if line not in counter:
            counter[line] = 1
        else:
            counter[line] += 1
print(counter)

Time Complexity

O(n)

You do not qualify what you mean by "very slow".

I have a text file comprised of 2.5 million different lines which can be processed in ~1.3s as follows:

from timeit import timeit

FILENAME = '/Volumes/G-Drive/foo.txt'

def get_counts():
    d = {}
    line_count = 0

    with open(FILENAME) as f:
        for line in map(str.strip, f):
            d[line] = d.get(line, 0) + 1
            line_count += 1
    key_count = len(d)
    print(f'{line_count=}, {key_count=}')
    return d

print(timeit(get_counts, number=1))

Output:

line_count=2500000, key_count=2500000
1.2901516660003836

Notes:

You could use Counter or defaultdict from the collections module but they are both slower than the strategy shown in this answer.

As I understand the required functionality you probably don't need to strip the lines. If you omit that you could see a further improvement of ~12%

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM