如何以更有效的方式比较Python中的两个文件？

Question

I have to compare two large files, but I'm having some problems related to performance. 我必须比较两个大文件，但是遇到一些与性能有关的问题。

So, let's consider two files X and Y . 因此，让我们考虑两个文件X和Y

X has 42000 records. X有42000条记录。 One word per line. 每行一个字。

Y has 881000. Three words per line, ie three columns. Y有881000。每行三个词，即三列。

I want to compare the words of the X file with the first word of the Y file. 我想将X文件的单词与Y文件的第一个单词进行比较。

If I find the X_word in the Y_first_column_word , then I write the the word of the second column of the Y file to a file(Y_second_column_word) . 如果在X_word中找到Y_first_column_word ，则将Y文件第二列的单词写入file(Y_second_column_word) 。

See the code: 看代码：

to_file = open( output_file, 'w' )                # opening the file to write
f1      = open( input_file1, "rU" ).readlines()   # reading 1st file  42000 records
f2      = open( input_file2, "rU" ).readlines()   # reading 2nd file 881000 records

for i, w1 in enumerate( f1 ):
    for j, line in enumerate( f2 ):
        w2 = line.split(',')                      # splitting words from  2nd file
        if w1.strip() == w2[0].strip():           # removing trails
            if w2[1].strip() == '':               # when it is blank, get 1st column word 
                w2[1] = w2[0]
            print>>to_file, w2[1]

to_file.close()                                   # closing the file

I've carried out tests run with test data, and it does what I want. 我已经使用测试数据进行了测试，它可以满足我的要求。 But when I run it with the real data it becomes unresponsive. 但是，当我使用实际数据运行它时，它就变得没有响应。 My last try spent 18 hours. 我的上一次尝试用了18个小时。

Is there any way to improve this code to get it running in a more efficient way? 有什么方法可以改进此代码，使其更有效地运行？

Answer 1

Your current approach is O(N**2) , if you use a dictionary to store the content of second file then you can do this in linear time. 您当前的方法是O(N**2) ，如果您使用字典来存储第二个文件的内容，则可以在线性时间内进行。

with open(input_file1, "rU")as f1, open(input_file2, "rU") as f2:
    words_dict = {k:v for k, v, _ in (line.split(',') for line in f2)}
    for word in f1:
        word = word.rstrip()
        if word in words_dict:
           #write words_dict[word] to to_file

如何以更有效的方式比较Python中的两个文件？

问题描述

1 个解决方案

解决方案1
4 已采纳 2014-10-29 11:43:58

如何以更有效的方式比较Python中的两个文件？

问题描述

1 个解决方案

解决方案1 4 已采纳 2014-10-29 11:43:58

解决方案1
4 已采纳 2014-10-29 11:43:58