简体   繁体   中英

How to compare two files based on key and string matching + awk

I have two files to compare based on $1 and $4 as keys. below are the sample files:

File1.txt
ID_41088912_41091911    2999    4   BAD016,BAD036,BBD052    7
ID_73937477_73940042    2565    3   BAD016,BAD036,BAD052    7
ID_32904202_32912400    8198    4   BAD016,BAD036,BAD052    7

File2.txt 
ID_41088912_41091911    2998    4   BAD016  7
ID_73937477_73940042    2565    3   AAAD016 7
ID_32904202_32912400    8198    4   BAD036  7

Search with $1 as key in both files and if the key matches, apply the second condition that if the string in $4 from File2 is not present in $4 in File1 remove the row from file1.

Output:
ID_41088912_41091911    2999    4   BAD016,BAD036,BBD052    7
ID_32904202_32912400    8198    4   BAD016,BAD036,BAD052    7

Second row from file1 is removed as "AAAD016" in $4,File2 is not present in $4,File1.

This matching can be done by populating an array, or arrays, with the relevant fields, indexed by record number. In the following script, the single entry field four is matched, as a regular expression, against the comma separated field four, and field one is simply tested for equivalence.

NR == FNR {
    # Check that $4 can be used as a pattern, this check
    # can be ommitted if the input is always valid.
    if ($4 !~ /^[[:alnum:]]+$/)
        exit 65; # EX_DATAERR
    a[NR] = $1;
    b[NR] = $4",|,"$4"|^"$4"$";
    next;
} $1 == a[FNR] && $4 ~ b[FNR]

The above script should be called with file2 first

awk -f script file2 file1

For a large file, the same process can be applied while reading the files line-by-line using getline .

BEGIN {
    if (ARGC != 3)
        exit 64; # EX_USAGE
    while (getline <ARGV[1]) {
        a = $1;
        b = $4",|,"$4"|^"$4"$";
        # Check that $4 can be used as a pattern, this check
        # can be ommitted if the input is always valid.
        if (b !~ /^[[:alnum:]]+$/)
            exit 65; # EX_DATAERR
        getline <ARGV[2];
        if ($1 == a && $4 ~ b)
            print;
    }
    exit;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM