[英]Make a diff of 2 text file quickly using python
I have 2 big text files (right now 17MB but could be GB), as such I don't want to load them in the ram because their size could exceed my ram capacity. 我有2个大文本文件(现在为17MB,但可能是GB),因此我不想将它们加载到ram中,因为它们的大小可能会超出我的ram容量。
The code I wrote for now is this : 我现在编写的代码是这样的:
def stopIfFileExist(filename):
if os.path.isfile(filename):
raise Exception("%s already exist" %filename)
def compareDump(before_filename, after_filename, diff_filename):
"""
Compare 2 dumps generated via makeDump(output_filename) and generate
a file containing the differences
-before_filename : (string) filename of the first dump
-after_filename : (string) filename of the second dump
-diff_filename : (string) filename of the diff
"""
stopIfFileExist(diff_filename)
num_lines = sum(1 for line in open(after_filename))
one_percent = num_lines/float(100)
diff = []
start = time.time()
with open(after_filename, "r") as afterFile:
counter = 0
for a_line in afterFile:
print "completion : %.9f percents" %(counter/float(one_percent))
counter = counter + 1
diff.append(a_line)
with open(before_filename, "r") as beforeFile:
for b_line in beforeFile:
if a_line.rstrip() == b_line.rstrip():
diff.pop()
break
end = time.time()
print "task completed in %s seconds" %(end - start)
with open(diff_filename, "a") as diffFile:
for line in diff:
diffFile.write(line)
what I'd like to do is remove from the beforeFile
a line that was sucessfully compared (eg, when the if a_line.rstrip() == b_line.rstrip():
is triggered) 我想做的是从beforeFile
删除已成功比较的行(例如, if a_line.rstrip() == b_line.rstrip():
了if a_line.rstrip() == b_line.rstrip():
However since I am currently reading the file I don't see how to do it. 但是,由于我当前正在读取文件,所以看不到该怎么做。
Any ideas? 有任何想法吗?
Thanks. 谢谢。
I was able to diff two 20 megabyte files in a little over 3 minutes using the following test code. 我可以使用下面的测试代码在3分钟多的时间内比较两个20 MB的文件。
Every 10,000 lines I put a random number, which you can see diff'd in the results. 我每10,000行放入一个随机数,您可以在结果中看到差异。
import random
import difflib
import os
import time
start = time.time()
NUM_LINES = int(10000000 / 4)
t1 = 'test1'
t2 = 'test2'
if os.path.exists(t1):
os.remove(t1)
if os.path.exists(t2):
os.remove(t2)
with open(t1, 'w+') as f1:
for number in range(1, NUM_LINES):
if number % 10000 == 0:
r = random.randint(1, number)
else:
r = 1
f1.write(str(number * r) + '\n')
else:
f1.seek(0)
with open(t2, 'w+') as f2:
for number in range(1, NUM_LINES):
if number % 10000 == 0:
r = random.randint(1, number)
else:
r = 1
f2.write(str(number * r) + '\n')
else:
f2.seek(0)
t1 = f1.readlines()
t2 = f2.readlines()
for l in difflib.unified_diff(t1, t2, lineterm=''):
print(l.strip())
print('Execution took: {:.2f} seconds'.format(time.time() - start))
I pasted the output on github , as it is obscenely long. 我将输出粘贴到github上 ,因为它太长了。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.