简体   繁体   中英

Faster way to remove duplicates from a very large text file in Python?

I have a very large text file with duplicate entries which I want to eliminate. I do not care about the order of the entries because the file will later be sorted.

Here is what I have so far:

unique_lines = set()
outfile = open("UniqueMasterList.txt", "w", encoding = "latin-1")

with open("MasterList.txt", "r", encoding = "latin-1") as infile:
    for line in infile:
        if line not in unique_lines:
            outfile.write(line)
            unique_lines.add(line)

outfile.close()

It has been running for 30 minutes and has not finished. I need it to be faster. What is a faster approach in Python ?

Look for the corresponding system command. In Linux/UNIX , you would use

uniq MasterList.txt > UniqueMasterList.txt

The OS generally knows the best way to do these things.


post-comment edit

@Mark Ransom reminded me that uniq depends on matching lines being contiguous in the file. The simplest way to achieve this is to sort the file:

sort MasterList.txt | uniq > UniqueMasterList.txt

To use the same technique as uniq , in Python:

import itertools
with open("MasterList.txt", "r", encoding = "latin-1") as infile:
    sorted_file = sorted(infile.readlines())
for line, _ in itertools.groupby(sorted_file):
    outfile.write(line)

This presumes that the entire file will fit into memory, twice. Or that the file is already sorted and you can skip that step.

我建议的简单方法是使用哈希表和哈希表。您可以使用高效的哈希函数对每一行进行哈希处理,然后将其插入哈希表中,并输出count为1的内容。类似于使用来解决单词/字母计数问题哈希表。查找仅花费o(1),并且可以将内存使用量限制为恒定量,具体取决于所使用的哈希表的大小。

SPLIT_COUNT = 30


def write_data(t_file, value):
    t_file.write(value)


def calculate_hash(filename, handle_file):

    with open(filename, 'r') as f:

        for line in f:

            write_data(handle_file[hash(line)%SPLIT_COUNT], line)


def generate_file(dir):

    handle_file, files = [], []

    for i in range(SPLIT_COUNT):

        path = dir+"split_"+str(i)

        files.append(path)

        f = open(path, 'w')

        handle_file.append(f)

    return files, handle_file


def close_file(handle_file):

    for i in range(len(handle_file)):

        handle_file[i].close()


def data_uniq(files, new_file):

    dataset = dict()

    n_file = open(new_file, 'w')

    for filename in files:

        f = open(filename, 'r')

        for line in f:

            dataset[line] = 1

        f.close()

        for key in dataset.keys():

            n_file.write(key)

        dataset = {}

    n_file.close()


if __name__ == "__main__":
    filename = './clean.txt'
    generate_dir = './tmp/'
    new_file = './out.txt'
    files, handle_file = generate_file(generate_dir)
    calculate_hash(filename, handle_file)
    close_file(handle_file)
    data_uniq(files, new_file)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM