使用python快速制作2个文本文件的差异

Question

I have 2 big text files (right now 17MB but could be GB), as such I don't want to load them in the ram because their size could exceed my ram capacity. 我有2个大文本文件（现在为17MB，但可能是GB），因此我不想将它们加载到ram中，因为它们的大小可能会超出我的ram容量。

The code I wrote for now is this : 我现在编写的代码是这样的：

def stopIfFileExist(filename):
    if os.path.isfile(filename): 
        raise Exception("%s already exist" %filename)

def compareDump(before_filename, after_filename, diff_filename):
    """
    Compare 2 dumps generated via makeDump(output_filename) and generate 
    a file containing the differences
        -before_filename : (string) filename of the first dump
        -after_filename : (string) filename of the second dump
        -diff_filename : (string) filename of the diff
    """

    stopIfFileExist(diff_filename)

    num_lines = sum(1 for line in open(after_filename))
    one_percent = num_lines/float(100)

    diff = []

    start = time.time()

    with open(after_filename, "r") as afterFile:
        counter = 0
        for a_line in afterFile:
            print "completion : %.9f percents" %(counter/float(one_percent))
            counter = counter + 1
            diff.append(a_line)
            with open(before_filename, "r") as beforeFile:
                for b_line in beforeFile:
                    if a_line.rstrip() == b_line.rstrip():
                        diff.pop()
                        break

    end = time.time()
    print "task completed in %s seconds" %(end - start)

    with open(diff_filename, "a") as diffFile:
        for line in diff:
            diffFile.write(line)

what I'd like to do is remove from the beforeFile a line that was sucessfully compared (eg, when the if a_line.rstrip() == b_line.rstrip(): is triggered) 我想做的是从beforeFile删除已成功比较的行（例如， if a_line.rstrip() == b_line.rstrip():了if a_line.rstrip() == b_line.rstrip():

However since I am currently reading the file I don't see how to do it. 但是，由于我当前正在读取文件，所以看不到该怎么做。

Any ideas? 有任何想法吗？

Thanks. 谢谢。

Answer 1

I was able to diff two 20 megabyte files in a little over 3 minutes using the following test code. 我可以使用下面的测试代码在3分钟多的时间内比较两个20 MB的文件。

Every 10,000 lines I put a random number, which you can see diff'd in the results. 我每10,000行放入一个随机数，您可以在结果中看到差异。

import random
import difflib
import os
import time

start = time.time()

NUM_LINES = int(10000000 / 4)
t1 = 'test1'
t2 = 'test2'

if os.path.exists(t1):
    os.remove(t1)
if os.path.exists(t2):
    os.remove(t2)

with open(t1, 'w+') as f1:
    for number in range(1, NUM_LINES):
        if number % 10000 == 0:
            r = random.randint(1, number)
        else:
            r = 1
        f1.write(str(number * r) + '\n')
    else:
        f1.seek(0)

    with open(t2, 'w+') as f2:
        for number in range(1, NUM_LINES):
            if number % 10000 == 0:
                r = random.randint(1, number)
            else:
                r = 1
            f2.write(str(number * r) + '\n')
        else:
            f2.seek(0)

        t1 = f1.readlines()
        t2 = f2.readlines()

for l in difflib.unified_diff(t1, t2, lineterm=''):
    print(l.strip())

print('Execution took: {:.2f} seconds'.format(time.time() - start))

I pasted the output on github , as it is obscenely long. 我将输出粘贴到github上，因为它太长了。

使用python快速制作2个文本文件的差异

问题描述

1 个解决方案

解决方案1
-1 2016-06-07 14:12:57

使用python快速制作2个文本文件的差异

问题描述

1 个解决方案

解决方案1 -1 2016-06-07 14:12:57

解决方案1
-1 2016-06-07 14:12:57