简体   繁体   English

比较两个文件以了解python中的差异

[英]Compare two files for differences in python

I want to compare two files (take line from first file and look up in whole second file) to see differences between them and write missing line from fileA.txt to end of fileB.txt. 我想比较两个文件(从第一个文件中取出一行,然后在整个第二个文件中查找),以查看它们之间的差异,并将缺少的行从fileA.txt写入fileB.txt的末尾。 I am new to python so at first time I thought abou simple program like this: 我是python的新手,所以我第一次想到这样的简单程序:

import difflib

file1 = "fileA.txt"
file2 = "fileB.txt"

diff = difflib.ndiff(open(file1).readlines(),open(file2).readlines())
print ''.join(diff),

but in result I have got a combination of two files with suitable tags for each line. 但是结果是我得到了两个文件的组合,每行都有合适的标签。 I know that I can look for line start with tag "-" and then write it to end of file fileB.txt, but with huge file (~100 MB) this method will be inefficient. 我知道我可以寻找以标签“-”开头的行,然后将其写到文件fileB.txt的末尾,但是对于大文件(〜100 MB),此方法效率不高。 Can somebody help me to improve program? 有人可以帮我改善程序吗?

File structure will be like this: 文件结构将如下所示:

input: 输入:

fileA.txt fileA.txt

Oct  9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2
Oct  9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2
Oct  9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:46:58 user sshd[12844]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct  9 13:46:58 user sshd[12844]: pam_unix(sshd:session): session closed for user root
Oct  9 15:47:58 user sshd[12868]: pam_unix(sshd:session): session closed for user root
Oct 11 22:17:31 user sshd[2655]: Accepted password for root from 17X.XXX.XXX.X19 port 5567 ssh2
Oct 11 22:17:31 user sshd[2655]: pam_unix(sshd:session): session opened for user root by (uid=0)

fileB.txt fileB.txt

    Oct  9 12:19:16 user sshd[12744]: Accepted password for root from 213.XXX.XXX.XX7 port 60554 ssh2
Oct  9 12:19:16 user sshd[12744]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:24:42 user sshd[12744]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct  9 13:24:42 user sshd[12744]: pam_unix(sshd:session): session closed for user root
Oct  9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2
Oct  9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2
Oct  9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0)

Output: 输出:

fileB_after.txt fileB_after.txt

Oct  9 12:19:16 user sshd[12744]: Accepted password for root from 213.XXX.XXX.XX7 port 60554 ssh2
Oct  9 12:19:16 user sshd[12744]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:24:42 user sshd[12744]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct  9 13:24:42 user sshd[12744]: pam_unix(sshd:session): session closed for user root
Oct  9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2
Oct  9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2
Oct  9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:46:58 user sshd[12844]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct  9 13:46:58 user sshd[12844]: pam_unix(sshd:session): session closed for user root
Oct  9 15:47:58 user sshd[12868]: pam_unix(sshd:session): session closed for user root
Oct 11 22:17:31 user sshd[2655]: Accepted password for root from 17X.XXX.XXX.X19 port 5567 ssh2
Oct 11 22:17:31 user sshd[2655]: pam_unix(sshd:session): session opened for user root by (uid=0)

Try with this in the bash : 这种尝试在bash

cat fileA.txt fileB.txt | sort -M | uniq > new_file.txt

sort -M : sorts based on initial string, consisting of any amount of whitespace, followed by a month name abbreviation, is folded to UPPER case and compared in the order 'JAN' < 'FEB' < ... < 'DEC'. sort -M根据初始字符串进行排序,该字符串由任意数量的空格组成,后跟月份名称的缩写,均折叠成大写形式,并按'JAN'<'FEB'<... <'DEC'的顺序进行比较。 Invalid names compare low to valid names. 无效名称与有效名称比较低。 The `LC_TIME' locale determines the month spellings. “ LC_TIME”语言环境确定月份的拼写。

uniq: filters out repeated lines in a file. uniq:过滤出文件中重复的行。

|: passes the output of one command to another for further processing. |:将一个命令的输出传递给另一个命令以进行进一步处理。

What this will do is take the two files, sort them in the way described above, keep the unique items and store them in new_file.txt 要做的是获取两个文件,按照上述方式对它们进行排序,保留唯一的项目并将其存储在new_file.txt

Note: This is not a python solution but you have tagged the question with linux so I thought it might interest you. 注意:这不是python解决方案,但是您已使用linux标记了该问题,因此我认为您可能会感兴趣。 Also you can find more detailed info about the commands used, here . 您也可以在此处找到有关所使用命令的更多详细信息。

read in two files and convert to set 读入两个文件并转换为set

find union of two sets 找到两个集合的并集
sort union set based on time 根据时间对并集进行排序
join set to string with new line 连接设置为新行的字符串

import datetime
import 
file1 = "fileA.txt"
file2 = "fileB.txt"

with open(file1 ,'rb') as f:
  sa = set( line for line in f )
with open(file2 ,'rb') as f:
  sb = set( line for line in f )
print '\n'.join( sorted( sa.union(sb), key = lambda x: datetime.datetime.strptime( ' '.join( x.split()[:3]), '%b %d %H:%M:%S' )) )



Oct  9 12:19:16 user sshd[12744]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 12:19:16 user sshd[12744]: Accepted password for root from 213.XXX.XXX.XX7 port 60554 ssh2
Oct  9 13:24:42 user sshd[12744]: pam_unix(sshd:session): session closed for user root
Oct  9 13:24:42 user sshd[12744]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct  9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2
Oct  9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2
Oct  9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:46:58 user sshd[12844]: pam_unix(sshd:session): session closed for user root
Oct  9 13:46:58 user sshd[12844]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct  9 15:47:58 user sshd[12868]: pam_unix(sshd:session): session closed for user root
Oct 11 22:17:31 user sshd[2655]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 11 22:17:31 user sshd[2655]: Accepted password for root from 17X.XXX.XXX.X19 port 5567 ssh2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM