[英]How can I compare two files in Python in a more efficiently way?
I have to compare two large files, but I'm having some problems related to performance. 我必须比较两个大文件,但是遇到一些与性能有关的问题。
So, let's consider two files X
and Y
. 因此,让我们考虑两个文件
X
和Y
X
has 42000 records. X
有42000条记录。 One word per line. 每行一个字。
Y
has 881000. Three words per line, ie three columns. Y
有881000。每行三个词,即三列。
I want to compare the words of the X
file with the first word of the Y
file. 我想将
X
文件的单词与Y
文件的第一个单词进行比较。
If I find the X_word
in the Y_first_column_word
, then I write the the word of the second column of the Y
file to a file(Y_second_column_word)
. 如果在
X_word
中找到Y_first_column_word
,则将Y
文件第二列的单词写入file(Y_second_column_word)
。
See the code: 看代码:
to_file = open( output_file, 'w' ) # opening the file to write
f1 = open( input_file1, "rU" ).readlines() # reading 1st file 42000 records
f2 = open( input_file2, "rU" ).readlines() # reading 2nd file 881000 records
for i, w1 in enumerate( f1 ):
for j, line in enumerate( f2 ):
w2 = line.split(',') # splitting words from 2nd file
if w1.strip() == w2[0].strip(): # removing trails
if w2[1].strip() == '': # when it is blank, get 1st column word
w2[1] = w2[0]
print>>to_file, w2[1]
to_file.close() # closing the file
I've carried out tests run with test data, and it does what I want. 我已经使用测试数据进行了测试,它可以满足我的要求。 But when I run it with the real data it becomes unresponsive.
但是,当我使用实际数据运行它时,它就变得没有响应。 My last try spent 18 hours.
我的上一次尝试用了18个小时。
Is there any way to improve this code to get it running in a more efficient way? 有什么方法可以改进此代码,使其更有效地运行?
Your current approach is O(N**2)
, if you use a dictionary to store the content of second file then you can do this in linear time. 您当前的方法是
O(N**2)
,如果您使用字典来存储第二个文件的内容,则可以在线性时间内进行。
with open(input_file1, "rU")as f1, open(input_file2, "rU") as f2:
words_dict = {k:v for k, v, _ in (line.split(',') for line in f2)}
for word in f1:
word = word.rstrip()
if word in words_dict:
#write words_dict[word] to to_file
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.