[英]Python - Improve performance on reading flat file line-by-line
I have a large .txt
file which I want to read one line at a time (rather than reading it all into memory, to avoid out-of-memory issues), and then extract all unique characters present in the file.我有一个很大的
.txt
文件,我想一次读取一行(而不是将其全部读入内存,以避免内存不足问题),然后提取文件中存在的所有唯一字符。 I have the below code which works well for small files but when I run it on a large file (which is the kind of files I need to typically carry out the exercise on) it runs extremely slowly eg around 1 hour for a 10GB file.我有下面的代码,它适用于小文件,但是当我在大文件(这是我通常需要执行练习的那种文件)上运行它时,它运行得非常慢,例如 10GB 文件大约需要 1 小时。 Can someone please suggest how I can improve the performance, for example by re-arranging the operations being performed, avoiding duplicate runs or using faster functions.
有人可以建议我如何提高性能,例如通过重新安排正在执行的操作,避免重复运行或使用更快的功能。
Thanks谢谢
def flatten(t):
'''Flatten list of lits'''
return [item for sublist in t for item in sublist]
input_file = r'C:\large_text_file.txt'
output_file = r'C:\char_set.txt'
#Parameters
case_sensitive = False
remove_crlf = True
#Extract all unique characters from file
charset = []
with open(input_file, 'r') as infile:
for line in infile:
if remove_crlf:
charset.append(list(line.rstrip())) #remove CRLF
else:
charset.append(list(line))
charset = flatten(charset) #flatten the list of lists
if not(case_sensitive):
charset = (map(lambda x: x.upper(), charset)) #convert to upper case
charset = list(dict.fromkeys(charset)) #remove duplicates
charset.sort(key=None, reverse=False) #sort character set in ascending order
infile.close() #close the input file
#Output the charater set
with open(output_file, 'w') as f:
for char in charset:
f.write(char)
You can very much simplify that to make it linear:您可以非常简化以使其线性:
charset = set() # use a real set!
with open(input_file, 'r') as infile:
for line in infile:
if remove_crlf:
line = line.strip()
if not case_sensitive:
line = line.upper()
charset.update(line)
with open(output_file, 'w') as f:
for char in sorted(charset):
f.write(char)
What made it quadratic, were all these lines:是什么使它成为二次的,是所有这些线:
charset = flatten(charset) #flatten the list of lists
charset = map(lambda x: x.upper(), charset)
charset = list(dict.fromkeys(charset))
where you keep performing operations on an ever-growing list instead of just the current line.您可以继续在不断增长的列表上执行操作,而不仅仅是当前行。
Some documentation:一些文档:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.