简体   繁体   中英

Extract Information from multiple lines using AWK

I am trying to extract different details from multiple lines using awk . However I fail to run a test and also print the resulting output in one line.

The information is located in different blocks and then I need to extract details inside the block..

awk '
     TRA TRB TRC
     /EKYC/{for(i=1; i<10; i++)
              {   (getline p )
                  if ( match(p,"TRA")) { print substr(p,4)}
                  if ( match(p,"TRB")) { print substr(p,4)}
                  if ( match(p,"TRC")) { print substr(p,4)}
              }
           }
    ' inputfile

The block EKYC will be there and the codes TRA TRB TRC will be located in between the EKYC blocks.

Sample text file is like below::

EKYC
TRA    onlyThisTRA1
TRB    onlyThisTRB1
THR    notThis
EKYC
TRA    onlyThisTRA2
TRB    onlyThisTRB2
TRC    onlyThisTRC2
EKYC
NOT
TRA    onlyThisTRA3
YEH    not this
TRC    onlyThisTRC3

Desired output.. in one line per block

onlyThisTRA1 onlyThisTRA2 null
onlyThisTRA2 onlyThisTRB2 onlyThisTRC2
onlyThisTRA3 null         onlyThisTRC3

You can use this awk command:

awk '/EKYC/{if (tra != "null") print tra, trb, trc; tra=trb=trc="null"; next}
$1=="TRA"{tra=$2} $1=="TRB"{trb=$2} $1=="TRC"{trc=$2}
END{print tra, trb, trc}' file

onlyThisTRA1 onlyThisTRB1 null
onlyThisTRA2 onlyThisTRB2 onlyThisTRC2
onlyThisTRA3 null onlyThisTRC3

Using awk multi dimensional arrays:

awk '/EKYC/ { cnt++;cnt1=0 } $0 != "EKYC" { cnt1++;if ($2 ~ "not") { $2 = "null" } dat[cnt,cnt1]=$2 } END { for (i=1;i<=cnt;i++) { for (p=1;p<=cnt1;p++) { printf "%s\t",dat[i,p] } print "" } }' filename

Set increment cnt when EKYC is seen and re-initialise cnt1. Use the counters to create and array of data storing the second space deliited piece of data. Finally loop through the multi-dimensional array to print the data.

awk solution:

awk 'function pr(a){ 
         n="null"; tra=a["TRA"]; trb=a["TRB"]; trc=a["TRC"]; 
         printf "%s %s %s\n",(tra)? tra:n,(trb)? trb:n,(trc)? trc:n; delete a 
     }
     /EKYC/{ if(f){ pr(a); f=0 } }
     /^TR[ABC]/{ a[$1]=$2; f=1 }END{ pr(a) }' file

The output:

onlyThisTRA1 onlyThisTRB1 null
onlyThisTRA2 onlyThisTRB2 onlyThisTRC2
onlyThisTRA3 null onlyThisTRC3

Whenever you have name-to-value pairs as you have in your data, the best approach is to first create an array capturing that mapping ( n2v[] below) and then you can just reference the values by their names:

$ cat tst.awk
BEGIN  { OFS="\t" }
/EKYC/ { prt(); next }
       { n2v[$1] = $2 }
END    { prt() }
function prt() { if (length(n2v)) print v("TRA"), v("TRB"), v("TRC"); delete n2v }
function v(n)  { return (n in n2v ? n2v[n] : "null") }

$ awk -f tst.awk file
onlyThisTRA1    onlyThisTRB1    null
onlyThisTRA2    onlyThisTRB2    onlyThisTRC2
onlyThisTRA3    null    onlyThisTRC3

Notice with the above that the names you're interested in each appear exactly once in either upper or lower case and there are no variables named based on the values in your data so if you need to add a new name you want to get printed (eg "THC") then you just add , v("THC") inside the prt() function, and it only specifies the default null value in one place so if you want a different default or a different algorithm for determining the default then you just change the v() function.

It would actually be trivial to modify the script to accept a list of names to be printed on the command line:

$ cat tst.awk
BEGIN  { OFS="\t" }
/EKYC/ { prt(); next }
       { val=$0; sub(/^[^[:space:]]+[[:space:]]+/,"",val); n2v[$1] = val }
END    { prt() }
function prt(       nameList,nameNr,numNames) {
    if (length(n2v)) {
        numNames = split(names,nameList)
        for (nameNr=1; nameNr <= numNames; nameNr++) {
            printf "%s%s", v(nameList[nameNr]), (nameNr<numNames ? OFS : ORS)
        }
        delete n2v
    }
}
function v(n)  { return (n in n2v ? n2v[n] : "null") }

$ awk -v names='TRA TRB TRC' -f tst.awk file
onlyThisTRA1    onlyThisTRB1    null
onlyThisTRA2    onlyThisTRB2    onlyThisTRC2
onlyThisTRA3    null    onlyThisTRC3

$ awk -v names='TRA THR TRC YEH' -f tst.awk file
onlyThisTRA1    notThis null    null
onlyThisTRA2    null    onlyThisTRC2    null
onlyThisTRA3    null    onlyThisTRC3    not this

Note that I modified the way that n2v[] is populated in that second script to allow for spaces to occur after your name value since your YEH value (which I'm now printing above) has a space in it. If there are no spaces then that change isn't required and if the separator is tab then you can just set FS="\\t" in the BEGIN section and then again you don't need that modification.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM