简体   繁体   English

在Python中比较两个大文件

[英]diff two big files in Python

I have two big text files, near 2GB each. 我有两个大文本文件,每个文件近2GB。 I need something like diff f1.txt f2.txt . 我需要类似diff f1.txt f2.txt Is there any way to do this task fast in python? 有什么办法可以在python中快速完成此任务? Standard difflib is too slow. 标准difflib太慢。 I assume there is faster way, because difflib is fully implemented in Python. 我认为有更快的方法,因为difflib是在Python中完全实现的。

How about using difflib in way that you script can handle big files? 在脚本可以处理大文件的方式下使用difflib怎么样? Don't load the files in memory, but iterate through the files of the files and diff in chunks. 不要将文件加载到内存中,而要遍历文件的文件并逐块进行比较。 For eg 100 lines at a time. 例如一次100条线。

import difflib

d = difflib.Differ()

f1 = open('bigfile1')
f2 = open('bigfile2')

b1 = []
b2 = []

for n, lines in enumerate(zip(f1,f2)):
    if not (n % 100 == 0):
        b1.append(lines[0])
        b2.append(lines[1])
    else:
        diff = d.compare("".join(b1), "".join(b2))
        b1 = []
        b2 = []
        print ''.join(list(diff))

diff = d.compare("".join(b1), "".join(b2))
print ''.join(list(diff))
f1.close()
f2.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM