I have a very large text file with duplicate entries which I want to eliminate. I do not care about the order of the entries because the file will later be sorted.
Here is what I have so far:
unique_lines = set()
outfile = open("UniqueMasterList.txt", "w", encoding = "latin-1")
with open("MasterList.txt", "r", encoding = "latin-1") as infile:
for line in infile:
if line not in unique_lines:
outfile.write(line)
unique_lines.add(line)
outfile.close()
It has been running for 30 minutes and has not finished. I need it to be faster. What is a faster approach in Python ?
Look for the corresponding system command. In Linux/UNIX , you would use
uniq MasterList.txt > UniqueMasterList.txt
The OS generally knows the best way to do these things.
post-comment edit
@Mark Ransom reminded me that uniq depends on matching lines being contiguous in the file. The simplest way to achieve this is to sort the file:
sort MasterList.txt | uniq > UniqueMasterList.txt
To use the same technique as uniq
, in Python:
import itertools
with open("MasterList.txt", "r", encoding = "latin-1") as infile:
sorted_file = sorted(infile.readlines())
for line, _ in itertools.groupby(sorted_file):
outfile.write(line)
This presumes that the entire file will fit into memory, twice. Or that the file is already sorted and you can skip that step.
我建议的简单方法是使用哈希表和哈希表。您可以使用高效的哈希函数对每一行进行哈希处理,然后将其插入哈希表中,并输出count为1的内容。类似于使用来解决单词/字母计数问题哈希表。查找仅花费o(1),并且可以将内存使用量限制为恒定量,具体取决于所使用的哈希表的大小。
SPLIT_COUNT = 30
def write_data(t_file, value):
t_file.write(value)
def calculate_hash(filename, handle_file):
with open(filename, 'r') as f:
for line in f:
write_data(handle_file[hash(line)%SPLIT_COUNT], line)
def generate_file(dir):
handle_file, files = [], []
for i in range(SPLIT_COUNT):
path = dir+"split_"+str(i)
files.append(path)
f = open(path, 'w')
handle_file.append(f)
return files, handle_file
def close_file(handle_file):
for i in range(len(handle_file)):
handle_file[i].close()
def data_uniq(files, new_file):
dataset = dict()
n_file = open(new_file, 'w')
for filename in files:
f = open(filename, 'r')
for line in f:
dataset[line] = 1
f.close()
for key in dataset.keys():
n_file.write(key)
dataset = {}
n_file.close()
if __name__ == "__main__":
filename = './clean.txt'
generate_dir = './tmp/'
new_file = './out.txt'
files, handle_file = generate_file(generate_dir)
calculate_hash(filename, handle_file)
close_file(handle_file)
data_uniq(files, new_file)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.