简体   繁体   English

快速找到两个大文本文件之间的差异

[英]Quickly find differences between two large text files

I have two 3GB text files, each file has around 80 million lines. 我有两个3GB的文本文件,每个文件有大约8000万行。 And they share 99.9% identical lines (file A has 60,000 unique lines, file B has 80,000 unique lines). 它们共享99.9%的相同行(文件A有60,000个唯一行,文件B有80,000个唯一行)。

How can I quickly find those unique lines in two files? 如何在两个文件中快速找到这些独特的行? Is there any ready-to-use command line tools for this? 是否有任何现成的命令行工具? I'm using Python but I guess it's less possible to find a efficient Pythonic method to load the files and compare. 我正在使用Python,但我认为找到一个有效的Pythonic方法加载文件并进行比较的可能性较小。

Any suggestions are appreciated. 任何建议表示赞赏。

If order matters, try the comm utility. 如果订单很重要,请尝试comm实用程序。 If order doesn't matter, sort file1 file2 | uniq -u 如果顺序无关紧要,请sort file1 file2 | uniq -u sort file1 file2 | uniq -u . sort file1 file2 | uniq -u

I think this is the fastest method (whether it's in Python or another language shouldn't matter too much IMO). 我认为这是最快的方法(无论是用Python还是其他语言都不应该对IMO太重要)。

Notes: 笔记:

1.I only store each line's hash to save space (and time if paging might occur) 1.我只存储每一行​​的散列以节省空间(如果可能发生分页,则为时间)

2.Because of the above, I only print out line numbers; 2.由于上述原因,我只打印出行号; if you need actual lines, you'd just need to read the files in again 如果你需要实际的行,你只需要再次读取文件

3.I assume that the hash function results in no conflicts. 3.我假设散列函数没有冲突。 This is nearly, but not perfectly, certain. 这几乎是肯定的,但并不完美。

4.I import hashlib because the built-in hash() function is too short to avoid conflicts. 4.I导入hashlib,因为内置的hash()函数太短而无法避免冲突。

import sys
import hashlib

file = []
lines = []
for i in range(2):
    # open the files named in the command line
    file.append(open(sys.argv[1+i], 'r'))
    # stores the hash value and the line number for each line in file i
    lines.append({})
    # assuming you like counting lines starting with 1
    counter = 1
    while 1:
        # assuming default encoding is sufficient to handle the input file
        line = file[i].readline().encode()
        if not line: break
        hashcode = hashlib.sha512(line).hexdigest()
        lines[i][hashcode] = sys.argv[1+i]+': '+str(counter)
        counter += 1
unique0 = lines[0].keys() - lines[1].keys()
unique1 = lines[1].keys() - lines[0].keys()
result = [lines[0][x] for x in unique0] + [lines[1][x] for x in unique1]

With 60,000 or 80,000 unique lines you could just create a dictionary for each unique line, mapping it to a number. 使用60,000或80,000个唯一行,您可以为每个唯一行创建一个字典,并将其映射到数字。 mydict["hello world"] => 1 , etc. If your average line is around 40-80 characters this will be in the neighborhood of 10 MB of memory. mydict["hello world"] => 1等。如果您的平均线数约为40-80个字符,那么这将是10 MB内存附近。

Then read each file, converting it to an array of numbers via the dictionary. 然后读取每个文件,通过字典将其转换为数字数组。 Those will fit easily in memory (2 files of 8 bytes * 3GB / 60k lines is less than 1 MB of memory). 这些将很容易适合内存(8个字节的2个文件* 3GB / 60k行小于1 MB的内存)。 Then diff the lists. 然后区分列表。 You could invert the dictionary and use it to print out the text of the lines that differ. 您可以反转字典并使用它来打印出不同行的文本。

EDIT: 编辑:

In response to your comment, here's a sample script that assigns numbers to unique lines as it reads from a file. 在回复您的评论时,这是一个示例脚本,在从文件中读取时将数字分配给唯一行。

#!/usr/bin/python

class Reader:

    def __init__(self, file):
        self.count = 0
        self.dict = {}
        self.file = file

    def readline(self):
        line = self.file.readline()
        if not line:
            return None
        if self.dict.has_key(line):
            return self.dict[line]
        else:
            self.count = self.count + 1
            self.dict[line] = self.count
            return self.count

if __name__ == '__main__':
    print "Type Ctrl-D to quit."
    import sys
    r = Reader(sys.stdin)
    result = 'ignore'
    while result:
        result = r.readline()
        print result

If I understand correctly, you want the lines of these files without duplicates. 如果我理解正确,您希望这些文件的行没有重复。 This does the job: 这样做的工作:

uniqA = set(open('fileA', 'r'))

Python有difflib声称与其他diff实用程序相当竞争,请参阅: http ://docs.python.org/library/difflib.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM