[英]How to extract some missing rows by comparing two different files in linux?
I have two diferrent files which some rows are missing in one of the files. 我有两个不同的文件,其中一个文件中缺少某些行。 I want to make a new file including those non-common rows between two files. 我想创建一个新文件,包括两个文件之间的非公共行。 as and example, I have following files: 例如,我有以下文件:
file1: 文件1:
id1
id22
id3
id4
id43
id100
id433
file2: 文件2:
id1
id2
id22
id3
id4
id8
id43
id100
id433
id21
I want to extract those rows which exist in file2 but do not in file1: 我想提取存在于file2但不在file1中的那些行:
new file: 新文件:
id2
id8
id21
any suggestion please? 有什么建议吗?
Use the comm
utility (assumes bash
as the shell): 使用comm
实用程序 (假设bash
为shell):
comm -13 <(sort file1) <(sort file2)
Note how the input must be sorted for this to work, so your delta will be sorted, too. 请注意必须如何对输入进行排序才能生效,因此您的delta也将进行排序。
comm
uses an (interleaved) 3-column layout: comm
使用(交错)3列布局:
-13
suppresses columns 1 and 2, which prints only the values exclusive to file2
. -13
禁止列1和2,它只打印file2
独有的值。
Caveat : For lines to be recognized as common to both files they must match exactly - seemingly identical lines that differ in terms of whitespace (as is the case in the sample data in the question as of this writing, where file1
lines have a trailing space ) will not match. 警告 : 对于要识别为两个文件共同的行,它们必须完全匹配 - 看似相同的行在空格方面不同(如本文所述,问题中的示例数据中的情况就是这样,其中file1
行具有尾随空格 ) 不会匹配。
cat -et
is a command that visualizes line endings and control characters, which is helpful in diagnosing such problems. cat -et
是一个可视化行结尾和控制字符的命令,有助于诊断此类问题。
For instance, cat -et file1
would output lines such as id1 $
, making it obvious that there's a trailing space at the end of the line (represented as $
). 例如, cat -et file1
将输出诸如id1 $
,这显然在行的末尾有一个尾随空格(表示为$
)。
If instead of cleaning up file1
you want to compare the files as-is, try: 如果不是清理file1
而是想按原样比较文件,请尝试:
comm -13 <(sed -E 's/ +$//' file1 | sort) <(sort file2)
A generalized solution that trims leading and trailing whitespace from the lines of both files: 一种通用的解决方案,可以从两个文件的行中修剪前导和尾随空格:
comm -13 <(sed -E 's/^[[:blank:]]+|[[:blank:]]+$//g' file1 | sort) \
<(sed -E 's/^[[:blank:]]+|[[:blank:]]+$//g' file2 | sort)
Note: The above sed
commands require either GNU or BSD sed
. 注意:上面的sed
命令需要GNU或BSD sed
。
您可以尝试对两个文件进行排序,然后计算重复的行,并仅选择计数为1的那些行
sort file1 file2 | uniq -c | awk '$1 == 1 {print $2}'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.