简体   繁体   English

使用python快速制作2个文本文件的差异

[英]Make a diff of 2 text file quickly using python

I have 2 big text files (right now 17MB but could be GB), as such I don't want to load them in the ram because their size could exceed my ram capacity. 我有2个大文本文件(现在为17MB,但可能是GB),因此我不想将它们加载到ram中,因为它们的大小可能会超出我的ram容量。

The code I wrote for now is this : 我现在编写的代码是这样的:

def stopIfFileExist(filename):
    if os.path.isfile(filename): 
        raise Exception("%s already exist" %filename)

def compareDump(before_filename, after_filename, diff_filename):
    """
    Compare 2 dumps generated via makeDump(output_filename) and generate 
    a file containing the differences
        -before_filename : (string) filename of the first dump
        -after_filename : (string) filename of the second dump
        -diff_filename : (string) filename of the diff
    """

    stopIfFileExist(diff_filename)

    num_lines = sum(1 for line in open(after_filename))
    one_percent = num_lines/float(100)

    diff = []

    start = time.time()

    with open(after_filename, "r") as afterFile:
        counter = 0
        for a_line in afterFile:
            print "completion : %.9f percents" %(counter/float(one_percent))
            counter = counter + 1
            diff.append(a_line)
            with open(before_filename, "r") as beforeFile:
                for b_line in beforeFile:
                    if a_line.rstrip() == b_line.rstrip():
                        diff.pop()
                        break

    end = time.time()
    print "task completed in %s seconds" %(end - start)

    with open(diff_filename, "a") as diffFile:
        for line in diff:
            diffFile.write(line)

what I'd like to do is remove from the beforeFile a line that was sucessfully compared (eg, when the if a_line.rstrip() == b_line.rstrip(): is triggered) 我想做的是从beforeFile删除已成功比较的行(例如, if a_line.rstrip() == b_line.rstrip():if a_line.rstrip() == b_line.rstrip():

However since I am currently reading the file I don't see how to do it. 但是,由于我当前正在读取文件,所以看不到该怎么做。

Any ideas? 有任何想法吗?

Thanks. 谢谢。

I was able to diff two 20 megabyte files in a little over 3 minutes using the following test code. 我可以使用下面的测试代码在3分钟多的时间内比较两个20 MB的文件。

Every 10,000 lines I put a random number, which you can see diff'd in the results. 我每10,000行放入一个随机数,您可以在结果中看到差异。

import random
import difflib
import os
import time

start = time.time()

NUM_LINES = int(10000000 / 4)
t1 = 'test1'
t2 = 'test2'

if os.path.exists(t1):
    os.remove(t1)
if os.path.exists(t2):
    os.remove(t2)

with open(t1, 'w+') as f1:
    for number in range(1, NUM_LINES):
        if number % 10000 == 0:
            r = random.randint(1, number)
        else:
            r = 1
        f1.write(str(number * r) + '\n')
    else:
        f1.seek(0)

    with open(t2, 'w+') as f2:
        for number in range(1, NUM_LINES):
            if number % 10000 == 0:
                r = random.randint(1, number)
            else:
                r = 1
            f2.write(str(number * r) + '\n')
        else:
            f2.seek(0)

        t1 = f1.readlines()
        t2 = f2.readlines()

for l in difflib.unified_diff(t1, t2, lineterm=''):
    print(l.strip())

print('Execution took: {:.2f} seconds'.format(time.time() - start))

I pasted the output on github , as it is obscenely long. 我将输出粘贴到github上 ,因为它太长了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM