[英]Python to count number of occurrence of each line, in huge txt file
A big txt file has millions of lines, that I want to count the number of occurrence of each line (how many times the line appear in the file).一个大的 txt 文件有数百万行,我想计算每一行的出现次数(该行在文件中出现的次数)。
Current solution I am using, is as below.我正在使用的当前解决方案如下。 It works but very slow.
它有效但非常慢。
What is the better way to do so?这样做的更好方法是什么? Thank you!
谢谢!
from collections import Counter
crimefile = open("C:\\temp\\large text_file.txt", 'r', encoding = 'utf-8')
yourResult = [line.strip().split('\n') for line in crimefile.readlines()]
yourResult = sum(yourResult, [])
result = dict((i, yourResult.count(i)) for i in yourResult)
output = sorted((value,key) for (key,value) in result.items())
print (Counter(yourResult))
Using a defaultdict
and looping over lines, instead of reading everything to memory.使用
defaultdict
并遍历行,而不是将所有内容读取到 memory。
counter = defaultdict(lambda: 0)
with open("C:\\temp\\large text_file.txt", "r", encoding="utf-8") as file:
for line in file:
counter[line.strip()] += 1
counter = dict(counter)
print(counter)
Tested with timeit
and 10k lines of text, roughly 40x faster on my machine.使用
timeit
和 10k 行文本进行测试,在我的机器上大约快 40 倍。
We can use a single for loop to do this.我们可以使用单个 for 循环来执行此操作。 We don't have to strip the new line character because every line will have it.
我们不必去掉换行符,因为每一行都会有它。
Solution解决方案
counter = {}
with open('filename/path', 'r', encoding='utf-8') as file:
for line in file:
if line not in counter:
counter[line] = 1
else:
counter[line] += 1
print(counter)
Time Complexity时间复杂度
O(n)
You do not qualify what you mean by "very slow".您没有用“非常慢”来限定您的意思。
I have a text file comprised of 2.5 million different lines which can be processed in ~1.3s as follows:我有一个包含 250 万行不同行的文本文件,可以在 ~1.3 秒内处理如下:
from timeit import timeit
FILENAME = '/Volumes/G-Drive/foo.txt'
def get_counts():
d = {}
line_count = 0
with open(FILENAME) as f:
for line in map(str.strip, f):
d[line] = d.get(line, 0) + 1
line_count += 1
key_count = len(d)
print(f'{line_count=}, {key_count=}')
return d
print(timeit(get_counts, number=1))
Output: Output:
line_count=2500000, key_count=2500000
1.2901516660003836
Notes:笔记:
You could use Counter or defaultdict from the collections module but they are both slower than the strategy shown in this answer.您可以使用collections模块中的 Counter 或 defaultdict,但它们都比此答案中显示的策略慢。
As I understand the required functionality you probably don't need to strip the lines.据我了解所需的功能,您可能不需要删除这些行。 If you omit that you could see a further improvement of ~12%
如果你忽略它,你可以看到大约 12% 的进一步改进
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.