如何以更有效的方式比较Python中的两个文件？

Question

我必须比较两个大文件，但是遇到一些与性能有关的问题。

因此，让我们考虑两个文件X和Y

X有42000条记录。 每行一个字。

Y有881000。每行三个词，即三列。

我想将X文件的单词与Y文件的第一个单词进行比较。

如果在X_word中找到Y_first_column_word ，则将Y文件第二列的单词写入file(Y_second_column_word) 。

看代码：

to_file = open( output_file, 'w' )                # opening the file to write
f1      = open( input_file1, "rU" ).readlines()   # reading 1st file  42000 records
f2      = open( input_file2, "rU" ).readlines()   # reading 2nd file 881000 records

for i, w1 in enumerate( f1 ):
    for j, line in enumerate( f2 ):
        w2 = line.split(',')                      # splitting words from  2nd file
        if w1.strip() == w2[0].strip():           # removing trails
            if w2[1].strip() == '':               # when it is blank, get 1st column word 
                w2[1] = w2[0]
            print>>to_file, w2[1]

to_file.close()                                   # closing the file

我已经使用测试数据进行了测试，它可以满足我的要求。 但是，当我使用实际数据运行它时，它就变得没有响应。 我的上一次尝试用了18个小时。

有什么方法可以改进此代码，使其更有效地运行？

Answer 1

您当前的方法是O(N**2) ，如果您使用字典来存储第二个文件的内容，则可以在线性时间内进行。

with open(input_file1, "rU")as f1, open(input_file2, "rU") as f2:
    words_dict = {k:v for k, v, _ in (line.split(',') for line in f2)}
    for word in f1:
        word = word.rstrip()
        if word in words_dict:
           #write words_dict[word] to to_file

如何以更有效的方式比较Python中的两个文件？

问题描述

1 个解决方案

解决方案1
4 已采纳 2014-10-29 11:43:58

如何以更有效的方式比较Python中的两个文件？

问题描述

1 个解决方案

解决方案1 4 已采纳 2014-10-29 11:43:58

解决方案1
4 已采纳 2014-10-29 11:43:58