I would like to compare two text files. The first text file has lines that aren't in the second text file. I would like to copy these lines and write them to a new txt file. I would like a Python script for this as I do this a lot and do not want to go online constantly to find these new lines. I do not need to acknowledge if there is something in file2 that is not in file1.
I have wrote some code that seems to work inconsistently. I am unsure what I am doing wrong.
newLines = open("file1.txt", "r")
originalLines = open("file2.txt", "r")
output = open("output.txt", "w")
lines1 = newLines.readlines()
lines2 = originalLines.readlines()
newLines.close()
originalLines.close()
duplicate = False
for line in lines1:
if line.isspace():
continue
for line2 in lines2:
if line == line2:
duplicate = True
break
if duplicate == False:
output.write(line)
else:
duplicate = False
output.close()
For file1.txt:
Man
Dog
Axe
Cat
Potato
Farmer
file2.txt:
Man
Dog
Axe
Cat
The output.txt should be:
Potato
Farmer
but it is instead this:
Cat
Potato
Farmer
Any help would be much appreciated!
Based on behavior, file2.txt
doesn't end with a newline, so the contents of lines2
is ['Man\\n', 'Dog\\n', 'Axe\\n', 'Cat']
. Note the lack of a newline for 'Cat'
.
I'd suggest normalizing your lines so they don't have newlines, replacing:
lines1 = newLines.readlines()
lines2 = originalLines.readlines()
with:
lines1 = [line.rstrip('\n') for line in newLines]
# Set comprehension makes lookup cheaper and dedupes
lines2 = {line.rstrip('\n') for line in originalLines}
and changing:
output.write(line)
to:
print(line, file=output)
which will add the newline for you. Really, the best solution is to avoid the inner loop entirely, changing all of this:
for line2 in lines2:
if line == line2:
duplicate = True
break
if duplicate == False:
output.write(line)
else:
duplicate = False
to just:
if line not in lines2:
print(line, file=output)
which, if you use a set
for lines2
as I suggest, makes the cost of the test drop from linear in the number of lines in file2.txt
to roughly constant no matter the size of file2.txt
(as long as the set of unique lines can fit in memory at all).
Even better, use with
statements for your open files, and stream file1.txt
rather than holding it in memory at all, and you end up with:
with open("file2.txt") as origlines:
lines2 = {line.rstrip('\n') for line in origlines}
with open("file1.txt") as newlines, open("output.txt", "w") as output:
for line in newlines:
line = line.rstrip('\n')
if not line.isspace() and line not in lines2:
print(line, file=output)
You can use numpy for smaller and faster solution. Here we are using these numpy methods np.loadtxt Docs: https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html np.setdiff1d Docs: https://docs.scipy.org/doc/numpy-1.14.5/reference/generated/numpy.setdiff1d.html np.savetxt Docs: https://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html
import numpy as np
arr=np.setdiff1d(np.loadtxt('file1.txt',dtype=str),np.loadtxt('file2.txt',dtype=str))
np.savetxt('output.txt',b,fmt='%s')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.