简体   繁体   中英

Compare column1 in File with column1 in File2, output {Column1 File1} that does not exist in file 2

Below is my file 1 content:

123|yid|def|
456|kks|jkl|
789|mno|vsasd|

and this is my file 2 content

123|abc|def|
456|ghi|jkl|
789|mno|pqr|
134|rst|uvw|

The only thing I want to compare in File 1 based on File 2 is column 1. Based on the files above, the output should only output:

134|rst|uvw|

Line to Line comparisons are not the answer since both column 2 and 3 contains different things but only column 1 contains the exact same thing in both files.

How can I achieve this?

Currently I'm using this in my code:

#sort FILEs first before comparing

sort $FILE_1 > $FILE_1_sorted
sort $FILE_2 > $FILE_2_sorted

for oid in $(cat $FILE_1_sorted |awk -F"|" '{print $1}');
do
echo "output oid $oid"

#for every oid in FILE 1, compare it with oid FILE 2 and output the difference

grep -v diff "^${oid}|" $FILE_1 $FILE_2 | grep \< | cut -d \  -f 2 > $FILE_1_tmp

You can do this in Awk very easily!

awk 'BEGIN{FS=OFS="|"}FNR==NR{unique[$1]; next}!($1 in unique)' file1 file2

Awk works by processing input lines one at a time. And there are special clauses which Awk provides, BEGIN{} and END{} which encloses actions to be run before and after the processing of the file.

So the part BEGIN{FS=OFS="|"} is set before the file processing happens, and FS and OFS are special variables in Awk which stand for input and output field separators. Since you have a provided a file that is de-limited by | you need to parse it by setting FS="|" also to print it back with | , so set OFS="|"

The main part of the command comes after BEGIN clause, the part FNR==NR is meant to process the first file argument provided in the command, because FNR keeps track of the line numbers for the both the files combined and NR for only the current file. So for each $1 in the first file, the values are hashed into the array called unique and then when the next file processing happens, the part !($1 in unique) will drop those lines in second file whose $1 value is not int the hashed array.

Here is another one liner that uses join , sort and grep

join -t"|" -j 1 -a 2 <(sort -t"|" -k1,1 file1) <(sort -t"|" -k1,1 file2) |\
   grep -E -v '.*\|.*\|.*\|.*\|'

join does two things here. It pairs all lines from both files with matching keys and, with the -a 2 option, also prints the unmatched lines from file2.

Since join requires input files to be sorted, we sort them.

Finally, grep removes all lines that contain more than three fields from the output.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM