[英]Reading lines from a file using python
我有一个几乎有100000行的文件。 我想进行清理(小写,删除停用词等),但是这需要时间。
以10000为例,脚本需要15分钟。 对于所有文件,我预计需要150分钟。 但是,这需要5个小时。
在开始时,该文件用于读取以下内容:
fileinput = open('tweets.txt', 'r')
lines = fileinput.read().lower() #for lower case, however it load all file
for line in fileinput:
lines = line.lower()
问题:我可以使用一种方法来读取清洗过程中的前10000行,然后再读取下一个博客行等吗?
我强烈建议逐行操作,而不要一次读取整个文件(换句话说,不要使用.read()
)。
with open('tweets.txt', 'r') as fileinput:
for line in fileinput:
line = line.lower()
# ... do something with line ...
# (for example, write the line to a new file, or print it)
尝试一次一行处理文件:
lowered = []
with open('tweets.txt', 'r') as handle:
for line in handle:
# keep accumulating the results ...
lowered.append(line.lower())
# or just dump the to stdout right away
print(line)
for line in lowered:
# print or write to file or whatever you require
这样,您可以减少内存开销,如果文件很大,这可能会导致交换并降低性能。
这是大约100万行的文件上的一些基准测试:
# (1) real 0.223 user 0.195 sys 0.026 pcpu 98.71
with open('medium.txt') as handle:
for line in handle:
pass
# (2) real 0.295 user 0.262 sys 0.025 pcpu 97.21
with open('medium.txt') as handle:
for i, line in enumerate(handle):
pass
print(i) # 1031124
# (3) real 21.561 user 5.072 sys 3.530 pcpu 39.89
with open('medium.txt') as handle:
for i, line in enumerate(handle):
print(line.lower())
# (4) real 1.702 user 1.605 sys 0.089 pcpu 99.50
lowered = []
with open('medium.txt') as handle:
for i, line in enumerate(handle):
lowered.append(line.lower())
# (5) real 2.307 user 1.983 sys 0.159 pcpu 92.89
lowered = []
with open('medium.txt', 'r') as handle:
for i, line in enumerate(handle):
lowered.append(line.lower())
with open('lowered.txt', 'w') as handle:
for line in lowered:
handle.write(line)
您还可以一次迭代两个文件:
# (6) real 1.944 user 1.666 sys 0.115 pcpu 91.59
with open('medium.txt', 'r') as src, open('lowered.txt', 'w') as sink:
for i, line in enumerate(src):
sink.write(line.lower())
结果如表:
# (1) noop 0.223
# (2) w/ enumerate 0.295
# (4) list buffer 1.702
# (6) on-the-fly 1.944
# (5) r -> list buffer -> w 2.307
# (3) stdout print 21.561
如下更改脚本:
with open('tweets.txt', 'r') as fileinput:
for line in fileinput:
"""do what you need to do with each line"""
line = line.lower()
因此,基本上,不要使用read()
将整个文件read()
入内存,而只是遍历打开文件的行。 当您将一个大文件读入内存时,您的进程可能会扩展到系统需要交换部分文件的地步,这将使其非常缓慢。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.