简体   繁体   中英

Removing duplicates from text file using python

I have this text file and let's say it contains 10 lines.

Bye
Hi
2
3
4
5
Hi
Bye
7
Hi

Every time it says "Hi" and "Bye" I want it to be removed except for the first time it was said. My current code is (yes filename is actually pointing towards a file, I just didn't place it in this one)

text_file = open(filename) 
for i, line in enumerate(text_file):
    if i == 0:
       var_Line1 = line
    if i = 1:
       var_Line2 = line
    if i > 1: 
       if line == var_Line2:
          del line
text_file.close()

It does detect the duplicates, but it takes a very long time considering the amount of lines there are, but I'm not sure on how to delete them and save it as well

Using a set & some basic filtering logic:

with open('test.txt') as f:
    seen = set()  # keep track of the lines already seen
    deduped = []
    for line in f:
        line = line.rstrip()
        if line not in seen:  # if not seen already, write the lines to result
            deduped.append(line)
        seen.add(line)

# re-write the file with the de-duplicated lines
with open('test.txt', 'w') as f:
    f.writelines([l + '\n' for l in deduped])

You could use dict.fromkeys to remove duplicates and preserve order efficiently:

with open(filename, "r") as f:
    lines = dict.fromkeys(f.readlines())
with open(filename, "w") as f:
    f.writelines(lines)

Idea from Raymond Hettinger

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM