简体   繁体   中英

How can I write the lines from the first text file that are not present in the second text file?

I would like to compare two text files. The first text file has lines that aren't in the second text file. I would like to copy these lines and write them to a new txt file. I would like a Python script for this as I do this a lot and do not want to go online constantly to find these new lines. I do not need to acknowledge if there is something in file2 that is not in file1.

I have wrote some code that seems to work inconsistently. I am unsure what I am doing wrong.

newLines = open("file1.txt", "r")
originalLines = open("file2.txt", "r")
output = open("output.txt", "w")

lines1 = newLines.readlines()
lines2 = originalLines.readlines()
newLines.close()
originalLines.close()

duplicate = False
for line in lines1:
    if line.isspace():
        continue
    for line2 in lines2:
        if line == line2:
            duplicate = True
            break

    if duplicate == False:
        output.write(line)
    else:
        duplicate = False

output.close()

For file1.txt:

Man
Dog
Axe
Cat
Potato
Farmer

file2.txt:

Man
Dog
Axe
Cat

The output.txt should be:

Potato
Farmer

but it is instead this:

Cat
Potato
Farmer

Any help would be much appreciated!

Based on behavior, file2.txt doesn't end with a newline, so the contents of lines2 is ['Man\\n', 'Dog\\n', 'Axe\\n', 'Cat'] . Note the lack of a newline for 'Cat' .

I'd suggest normalizing your lines so they don't have newlines, replacing:

lines1 = newLines.readlines()
lines2 = originalLines.readlines()

with:

lines1 = [line.rstrip('\n') for line in newLines]
# Set comprehension makes lookup cheaper and dedupes
lines2 = {line.rstrip('\n') for line in originalLines}

and changing:

output.write(line)

to:

print(line, file=output)

which will add the newline for you. Really, the best solution is to avoid the inner loop entirely, changing all of this:

for line2 in lines2:
    if line == line2:
        duplicate = True
        break

if duplicate == False:
    output.write(line)
else:
    duplicate = False

to just:

if line not in lines2:
    print(line, file=output)

which, if you use a set for lines2 as I suggest, makes the cost of the test drop from linear in the number of lines in file2.txt to roughly constant no matter the size of file2.txt (as long as the set of unique lines can fit in memory at all).

Even better, use with statements for your open files, and stream file1.txt rather than holding it in memory at all, and you end up with:

with open("file2.txt") as origlines:
    lines2 = {line.rstrip('\n') for line in origlines}

with open("file1.txt") as newlines, open("output.txt", "w") as output:
    for line in newlines:
        line = line.rstrip('\n')
        if not line.isspace() and line not in lines2:
            print(line, file=output)

You can use numpy for smaller and faster solution. Here we are using these numpy methods np.loadtxt Docs: https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html np.setdiff1d Docs: https://docs.scipy.org/doc/numpy-1.14.5/reference/generated/numpy.setdiff1d.html np.savetxt Docs: https://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html

import numpy as np


arr=np.setdiff1d(np.loadtxt('file1.txt',dtype=str),np.loadtxt('file2.txt',dtype=str))
np.savetxt('output.txt',b,fmt='%s')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM