简体   繁体   English

使用AWK从多行提取信息

[英]Extract Information from multiple lines using AWK

I am trying to extract different details from multiple lines using awk . 我正在尝试使用awk从多行提取不同的细节。 However I fail to run a test and also print the resulting output in one line. 但是,我无法运行测试,也无法在一行中打印结果输出。

The information is located in different blocks and then I need to extract details inside the block.. 该信息位于不同的块中,然后我需要在该块中提取详细信息。

awk '
     TRA TRB TRC
     /EKYC/{for(i=1; i<10; i++)
              {   (getline p )
                  if ( match(p,"TRA")) { print substr(p,4)}
                  if ( match(p,"TRB")) { print substr(p,4)}
                  if ( match(p,"TRC")) { print substr(p,4)}
              }
           }
    ' inputfile

The block EKYC will be there and the codes TRA TRB TRC will be located in between the EKYC blocks. EKYC块将在那里,代码TRA TRB TRC将位于EKYC块之间。

Sample text file is like below:: 示例文本文件如下:

EKYC
TRA    onlyThisTRA1
TRB    onlyThisTRB1
THR    notThis
EKYC
TRA    onlyThisTRA2
TRB    onlyThisTRB2
TRC    onlyThisTRC2
EKYC
NOT
TRA    onlyThisTRA3
YEH    not this
TRC    onlyThisTRC3

Desired output.. in one line per block 所需的输出..每块一行

onlyThisTRA1 onlyThisTRA2 null
onlyThisTRA2 onlyThisTRB2 onlyThisTRC2
onlyThisTRA3 null         onlyThisTRC3

You can use this awk command: 您可以使用以下awk命令:

awk '/EKYC/{if (tra != "null") print tra, trb, trc; tra=trb=trc="null"; next}
$1=="TRA"{tra=$2} $1=="TRB"{trb=$2} $1=="TRC"{trc=$2}
END{print tra, trb, trc}' file

onlyThisTRA1 onlyThisTRB1 null
onlyThisTRA2 onlyThisTRB2 onlyThisTRC2
onlyThisTRA3 null onlyThisTRC3

Using awk multi dimensional arrays: 使用awk多维数组:

awk '/EKYC/ { cnt++;cnt1=0 } $0 != "EKYC" { cnt1++;if ($2 ~ "not") { $2 = "null" } dat[cnt,cnt1]=$2 } END { for (i=1;i<=cnt;i++) { for (p=1;p<=cnt1;p++) { printf "%s\t",dat[i,p] } print "" } }' filename

Set increment cnt when EKYC is seen and re-initialise cnt1. 看到EKYC时,设置增量cnt,然后重新初始化cnt1。 Use the counters to create and array of data storing the second space deliited piece of data. 使用计数器创建和存储第二个空间已分配数据的数据数组。 Finally loop through the multi-dimensional array to print the data. 最后遍历多维数组以打印数据。

awk solution: awk解决方案:

awk 'function pr(a){ 
         n="null"; tra=a["TRA"]; trb=a["TRB"]; trc=a["TRC"]; 
         printf "%s %s %s\n",(tra)? tra:n,(trb)? trb:n,(trc)? trc:n; delete a 
     }
     /EKYC/{ if(f){ pr(a); f=0 } }
     /^TR[ABC]/{ a[$1]=$2; f=1 }END{ pr(a) }' file

The output: 输出:

onlyThisTRA1 onlyThisTRB1 null
onlyThisTRA2 onlyThisTRB2 onlyThisTRC2
onlyThisTRA3 null onlyThisTRC3

Whenever you have name-to-value pairs as you have in your data, the best approach is to first create an array capturing that mapping ( n2v[] below) and then you can just reference the values by their names: 每当您拥有数据中的“名称/值”对时,最好的方法是首先创建一个捕获该映射的数组(下面的n2v[] ),然后可以通过它们的名称引用这些值:

$ cat tst.awk
BEGIN  { OFS="\t" }
/EKYC/ { prt(); next }
       { n2v[$1] = $2 }
END    { prt() }
function prt() { if (length(n2v)) print v("TRA"), v("TRB"), v("TRC"); delete n2v }
function v(n)  { return (n in n2v ? n2v[n] : "null") }

$ awk -f tst.awk file
onlyThisTRA1    onlyThisTRB1    null
onlyThisTRA2    onlyThisTRB2    onlyThisTRC2
onlyThisTRA3    null    onlyThisTRC3

Notice with the above that the names you're interested in each appear exactly once in either upper or lower case and there are no variables named based on the values in your data so if you need to add a new name you want to get printed (eg "THC") then you just add , v("THC") inside the prt() function, and it only specifies the default null value in one place so if you want a different default or a different algorithm for determining the default then you just change the v() function. 注意到以上内容,您感兴趣的每个名称在大写或小写字母中仅出现一次,并且没有基于数据中的值命名的变量,因此,如果您需要添加新名称,则需要打印(例如“ THC”),则只需在prt()函数内添加, v("THC") ,并且它仅在一个地方指定默认的null值,因此,如果要使用其他默认值或用于确定默认值的算法,则您只需更改v()函数。

It would actually be trivial to modify the script to accept a list of names to be printed on the command line: 修改脚本以接受要在命令行上打印的名称列表实际上是微不足道的:

$ cat tst.awk
BEGIN  { OFS="\t" }
/EKYC/ { prt(); next }
       { val=$0; sub(/^[^[:space:]]+[[:space:]]+/,"",val); n2v[$1] = val }
END    { prt() }
function prt(       nameList,nameNr,numNames) {
    if (length(n2v)) {
        numNames = split(names,nameList)
        for (nameNr=1; nameNr <= numNames; nameNr++) {
            printf "%s%s", v(nameList[nameNr]), (nameNr<numNames ? OFS : ORS)
        }
        delete n2v
    }
}
function v(n)  { return (n in n2v ? n2v[n] : "null") }

$ awk -v names='TRA TRB TRC' -f tst.awk file
onlyThisTRA1    onlyThisTRB1    null
onlyThisTRA2    onlyThisTRB2    onlyThisTRC2
onlyThisTRA3    null    onlyThisTRC3

$ awk -v names='TRA THR TRC YEH' -f tst.awk file
onlyThisTRA1    notThis null    null
onlyThisTRA2    null    onlyThisTRC2    null
onlyThisTRA3    null    onlyThisTRC3    not this

Note that I modified the way that n2v[] is populated in that second script to allow for spaces to occur after your name value since your YEH value (which I'm now printing above) has a space in it. 请注意,我修改了第二个脚本中n2v[]的填充方式,以允许在您的名称值之后出现空格,因为您的YEH值(我现在正在上面打印)中有一个空格。 If there are no spaces then that change isn't required and if the separator is tab then you can just set FS="\\t" in the BEGIN section and then again you don't need that modification. 如果没有空格,则不需要进行更改;如果分隔符是制表符,则只需在BEGIN部分中设置FS="\\t" ,然后再次不需要该修改。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM