Python - 提高逐行讀取平面文件的性能

Question

我有一個很大的.txt文件，我想一次讀取一行（而不是將其全部讀入內存，以避免內存不足問題），然后提取文件中存在的所有唯一字符。 我有下面的代碼，它適用於小文件，但是當我在大文件（這是我通常需要執行練習的那種文件）上運行它時，它運行得非常慢，例如 10GB 文件大約需要 1 小時。 有人可以建議我如何提高性能，例如通過重新安排正在執行的操作，避免重復運行或使用更快的功能。

謝謝

def flatten(t):
'''Flatten list of lits'''
    return [item for sublist in t for item in sublist]

input_file = r'C:\large_text_file.txt'
output_file = r'C:\char_set.txt'

#Parameters
case_sensitive = False
remove_crlf = True

#Extract all unique characters from file
charset = []
with open(input_file, 'r') as infile:
    for line in infile:
        if remove_crlf:
            charset.append(list(line.rstrip())) #remove CRLF
        else:
            charset.append(list(line))
        
        charset = flatten(charset) #flatten the list of lists

        if not(case_sensitive):
            charset = (map(lambda x: x.upper(), charset)) #convert to upper case

        charset = list(dict.fromkeys(charset)) #remove duplicates

charset.sort(key=None, reverse=False) #sort character set in ascending order

infile.close() #close the input file

#Output the charater set
with open(output_file, 'w') as f:
    for char in charset:
        f.write(char)

Answer 1

您可以非常簡化以使其線性：

charset = set()  # use a real set!
with open(input_file, 'r') as infile:
    for line in infile:
        if remove_crlf:
            line = line.strip()
        if not case_sensitive:
            line = line.upper()
        charset.update(line)

with open(output_file, 'w') as f:
    for char in sorted(charset):
        f.write(char)

是什么使它成為二次的，是所有這些線：

charset = flatten(charset) #flatten the list of lists
charset = map(lambda x: x.upper(), charset)
charset = list(dict.fromkeys(charset))

您可以繼續在不斷增長的列表上執行操作，而不僅僅是當前行。

一些文檔：

set.update

Python - 提高逐行讀取平面文件的性能

問題描述

1 個解決方案

解決方案1
2 已采納 2021-10-29 13:36:14

Python - 提高逐行讀取平面文件的性能

問題描述

1 個解決方案

解決方案1 2 已采納 2021-10-29 13:36:14

解決方案1
2 已采納 2021-10-29 13:36:14