简体   繁体   中英

How to sort text document, keep order and only unique lines

I have text document with words, line under line

text1
text2
text3
text2
text4
text4
text2
text3

now I want remove all copies, keep unique lines only and keep original order:

text1
text2
text3
text4

I have several solutions, but nothing works for me correct

this one keeps only unique lines,

with open('C:\folder\filedoc.txt', 'r') as lines: 
    lines_set = {line.strip() for line in lines}
with open('C:\folder\filedoc.txt', 'w') as out:
    for line in lines_set:
        out.write(line + '\n')

but not the order:

1. text2
2. text5
3. text3
4. text4
5. text1

this one keeps order but same words too:

with open('C:\folder\filedoc.txt', 'r') as lines:
    lines_set = []
    for line in lines:
        if line.strip() not in lines_set:  
            lines_set.append(line.strip())

this one works well, but with input text:

   with open('C:\my_path\doc.txt', 'r') as lines:
       lines_set = []
       for line in lines:
            if line.strip() not in lines_set: 
                lines_set.append(line.strip())

I don't want use input, need somehow sort ordered list itself. with each cycle I've add a new word in text file, but with certain condition in a certain (and not each cycle) I want remove duplicated words at once. I need a continually expanding list with one line, but keep it in original order after removing of same words

this code works correct for me, exactly how I need, but with wrong results in many other conditions with returned list if I go this way with def and function:

def loadlines1(f):
    with open(f, 'r') as lines:
        lines_set = []
        for line in lines:
            if line.strip() not in lines_set:
                lines_set.append(line.strip())
    return lines_set

def loadlines2(f):
    with open(f, 'r') as lines:
        lines_set = []
        for line in lines:
            lines_set.append(line.strip())
    return lines_set

def removeDuplicates(l):
    out = list(set(l))
    for i in enumerate(out):
        out[i[0]] = l.index(i[1])
    out.sort()
    for i in enumerate(out):
        out[i[0]] = l[i[1]]
    return out

def savelines(f, l):
    open(f, 'w').write('\n'.join(l))

lines = loadlines2('C:\folder\filedoc.txt')
stripped_lines = removeDuplicates(lines)    
savelines('doc.txt', stripped_lines)

would be good if I can avoid any return analysis

now I'm found this one, but not sure how to figure out with it

lines_seen = set() 
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
    if line not in lines_seen: 
        outfile.write(line)
        lines_seen.add(line)
outfile.close()

and this one maybe too:

with open('C:\folder\filedoc.txt', 'r') as afile:
    a = set(line.rstrip('\n') for line in afile)

with open('C:\folder\filedoc.txt', 'r') as bfile:
    for line in bfile:
        line = line.rstrip('\n')
        if line not in a:
            print(line)
            a.add(line)

so can you help me figure out with this problem, please

the best solution for me how I imagine it, if it is possible of course, I don't know exactly how to do it, but I guess this way: read all lines in my document and find all same words (and not compare with new one only like in variant with input) then somehow remove all extra same words and keep only unique, then copy all list and rewrite it over the previous doc... so maybe something like this in the end of each cycle, if condition in cycle was. but not sure maybe there is a some better and easy way

You can get a list in original order with all duplicates removed by doing something like this:

from collections import OrderedDict
no_duplicates = list(OrderedDict.fromkeys(f.readlines()))

And then all you have to do is write it back to the file.

This should work:

from collections import OrderedDict

with open('file.txt', 'r') as f:
    items = list(OrderedDict.fromkeys(f.readlines()))

with open('file.txt', 'w') as f:
    for item in items:
        f.write(item)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM