[英]How can I write the lines from the first text file that are not present in the second text file?
I would like to compare two text files. 我想比较两个文本文件。 The first text file has lines that aren't in the second text file.
第一个文本文件中的行不在第二个文本文件中。 I would like to copy these lines and write them to a new txt file.
我想复制这些行并将它们写到新的txt文件中。 I would like a Python script for this as I do this a lot and do not want to go online constantly to find these new lines.
我想要一个Python脚本,因为我经常这样做,并且不想经常上网查找这些新行。 I do not need to acknowledge if there is something in file2 that is not in file1.
我不需要确认file2中是否有一些不在file1中的东西。
I have wrote some code that seems to work inconsistently. 我写了一些似乎不一致的代码。 I am unsure what I am doing wrong.
我不确定自己在做什么错。
newLines = open("file1.txt", "r")
originalLines = open("file2.txt", "r")
output = open("output.txt", "w")
lines1 = newLines.readlines()
lines2 = originalLines.readlines()
newLines.close()
originalLines.close()
duplicate = False
for line in lines1:
if line.isspace():
continue
for line2 in lines2:
if line == line2:
duplicate = True
break
if duplicate == False:
output.write(line)
else:
duplicate = False
output.close()
For file1.txt: 对于file1.txt:
Man
Dog
Axe
Cat
Potato
Farmer
file2.txt: file2.txt:
Man
Dog
Axe
Cat
The output.txt should be: output.txt应该是:
Potato
Farmer
but it is instead this: 而是这样的:
Cat
Potato
Farmer
Any help would be much appreciated! 任何帮助将非常感激!
Based on behavior, file2.txt
doesn't end with a newline, so the contents of lines2
is ['Man\\n', 'Dog\\n', 'Axe\\n', 'Cat']
. 基于行为,
file2.txt
不以回车结束,所以内容lines2
为['Man\\n', 'Dog\\n', 'Axe\\n', 'Cat']
Note the lack of a newline for 'Cat'
. 注意缺少
'Cat'
的换行符。
I'd suggest normalizing your lines so they don't have newlines, replacing: 我建议对您的行进行规范化,以便它们没有换行符,而替换为:
lines1 = newLines.readlines()
lines2 = originalLines.readlines()
with: 与:
lines1 = [line.rstrip('\n') for line in newLines]
# Set comprehension makes lookup cheaper and dedupes
lines2 = {line.rstrip('\n') for line in originalLines}
and changing: 并更改:
output.write(line)
to: 至:
print(line, file=output)
which will add the newline for you. 它将为您添加换行符。 Really, the best solution is to avoid the inner loop entirely, changing all of this:
确实,最好的解决方案是完全避免内部循环,更改所有这些内容:
for line2 in lines2:
if line == line2:
duplicate = True
break
if duplicate == False:
output.write(line)
else:
duplicate = False
to just: 只是:
if line not in lines2:
print(line, file=output)
which, if you use a set
for lines2
as I suggest, makes the cost of the test drop from linear in the number of lines in file2.txt
to roughly constant no matter the size of file2.txt
(as long as the set of unique lines can fit in memory at all). 如果您按照我的建议对
lines2
使用一set
,那么无论file2.txt
的大小如何,测试的成本都会从file2.txt
的行数线性file2.txt
到大致恒定(只要这组唯一)行可以完全放在内存中)。
Even better, use with
statements for your open files, and stream file1.txt
rather than holding it in memory at all, and you end up with: 更好的是,对打开的文件使用
with
语句,并流file1.txt
而不是完全将其保存在内存中,最终结果是:
with open("file2.txt") as origlines:
lines2 = {line.rstrip('\n') for line in origlines}
with open("file1.txt") as newlines, open("output.txt", "w") as output:
for line in newlines:
line = line.rstrip('\n')
if not line.isspace() and line not in lines2:
print(line, file=output)
You can use numpy for smaller and faster solution. 您可以将numpy用于更小,更快的解决方案。 Here we are using these numpy methods np.loadtxt Docs: https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html np.setdiff1d Docs: https://docs.scipy.org/doc/numpy-1.14.5/reference/generated/numpy.setdiff1d.html np.savetxt Docs: https://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html
在这里,我们使用以下numpy方法np.loadtxt文档: https: //docs.scipy.org/doc/numpy/reference/produced/numpy.loadtxt.html np.setdiff1d文档: https : //docs.scipy.org/ doc / numpy-1.14.5 / reference / generated / numpy.setdiff1d.html np.savetxt文件: https : //docs.scipy.org/doc/numpy/reference/generation/numpy.savetxt.html
import numpy as np
arr=np.setdiff1d(np.loadtxt('file1.txt',dtype=str),np.loadtxt('file2.txt',dtype=str))
np.savetxt('output.txt',b,fmt='%s')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.