简体   繁体   中英

How to extract some missing rows by comparing two different files in linux?

I have two diferrent files which some rows are missing in one of the files. I want to make a new file including those non-common rows between two files. as and example, I have following files:

file1:

id1 
id22 
id3 
id4 
id43 
id100 
id433 

file2:

id1
id2
id22
id3
id4
id8
id43
id100
id433
id21

I want to extract those rows which exist in file2 but do not in file1:

new file:

 id2
 id8 
 id21

any suggestion please?

Use the comm utility (assumes bash as the shell):

comm -13 <(sort file1) <(sort file2)

Note how the input must be sorted for this to work, so your delta will be sorted, too.

comm uses an (interleaved) 3-column layout:

  • column 1: lines only in file1
  • column 2: lines only in file2
  • column 2: lines in both files

-13 suppresses columns 1 and 2, which prints only the values exclusive to file2 .

Caveat : For lines to be recognized as common to both files they must match exactly - seemingly identical lines that differ in terms of whitespace (as is the case in the sample data in the question as of this writing, where file1 lines have a trailing space ) will not match.

cat -et is a command that visualizes line endings and control characters, which is helpful in diagnosing such problems.

For instance, cat -et file1 would output lines such as id1 $ , making it obvious that there's a trailing space at the end of the line (represented as $ ).


If instead of cleaning up file1 you want to compare the files as-is, try:

comm -13 <(sed -E 's/ +$//' file1 | sort) <(sort file2)

A generalized solution that trims leading and trailing whitespace from the lines of both files:

comm -13 <(sed -E 's/^[[:blank:]]+|[[:blank:]]+$//g' file1 | sort) \
         <(sed -E 's/^[[:blank:]]+|[[:blank:]]+$//g' file2 | sort)

Note: The above sed commands require either GNU or BSD sed .

您可以尝试对两个文件进行排序,然后计算重复的行,并仅选择计数为1的那些行

sort file1 file2 | uniq -c | awk '$1 == 1 {print $2}'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM