简体   繁体   English

如何以更有效的方式比较Python中的两个文件?

[英]How can I compare two files in Python in a more efficiently way?

I have to compare two large files, but I'm having some problems related to performance. 我必须比较两个大文件,但是遇到一些与性能有关的问题。

So, let's consider two files X and Y . 因此,让我们考虑两个文件XY

X has 42000 records. X有42000条记录。 One word per line. 每行一个字。

Y has 881000. Three words per line, ie three columns. Y有881000。每行三个词,即三列。

I want to compare the words of the X file with the first word of the Y file. 我想将X文件的单词与Y文件的第一个单词进行比较。

If I find the X_word in the Y_first_column_word , then I write the the word of the second column of the Y file to a file(Y_second_column_word) . 如果在X_word中找到Y_first_column_word ,则将Y文件第二列的单词写入file(Y_second_column_word)

See the code: 看代码:

to_file = open( output_file, 'w' )                # opening the file to write
f1      = open( input_file1, "rU" ).readlines()   # reading 1st file  42000 records
f2      = open( input_file2, "rU" ).readlines()   # reading 2nd file 881000 records

for i, w1 in enumerate( f1 ):
    for j, line in enumerate( f2 ):
        w2 = line.split(',')                      # splitting words from  2nd file
        if w1.strip() == w2[0].strip():           # removing trails
            if w2[1].strip() == '':               # when it is blank, get 1st column word 
                w2[1] = w2[0]
            print>>to_file, w2[1]

to_file.close()                                   # closing the file

I've carried out tests run with test data, and it does what I want. 我已经使用测试数据进行了测试,它可以满足我的要求。 But when I run it with the real data it becomes unresponsive. 但是,当我使用实际数据运行它时,它就变得没有响应。 My last try spent 18 hours. 我的上一次尝试用了18个小时。

Is there any way to improve this code to get it running in a more efficient way? 有什么方法可以改进此代码,使其更有效地运行?

Your current approach is O(N**2) , if you use a dictionary to store the content of second file then you can do this in linear time. 您当前的方法是O(N**2) ,如果您使用字典来存储第二个文件的内容,则可以在线性时间内进行。

with open(input_file1, "rU")as f1, open(input_file2, "rU") as f2:
    words_dict = {k:v for k, v, _ in (line.split(',') for line in f2)}
    for word in f1:
        word = word.rstrip()
        if word in words_dict:
           #write words_dict[word] to to_file

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM