简体   繁体   中英

awk: how to extract 2 patterns from a single line and then concatenate them?

I want to find 2 patterns in each line and then print them with a dash between them as a separator. Here is a sample of lines:

20200323: #5357 BEAR_SPX_X15_NORDNET_D1 {CU=DKK, ES=E, II=DK0061205473, IR=NRB, LN=BEAR SPX X15 NORDNET D1, MIC=FNDK, NS=1, PC=C, SE=193133, SG=250, SN=193133, TK="0.01 to 100,0.05 to 500,0.1", TS=BEAR_SPX_X15_NORDNET_D1, TY=W, UQ=1}
20200323: #5358 BULL_SPX_X10_NORDNET_D2 {CU=DKK, ES=E, II=DK0061205556, IR=NRB, LN=BULL SPX X10 NORDNET D2, MIC=FNDK, NS=1, PC=P, SE=193132, SG=250, SN=193132, TK="0.01 to 100,0.05 to 500,0.1", TS=BULL_SPX_X10_NORDNET_D2, TY=W, UQ=1}
20200323: #5359 BULL_SPX_X12_NORDNET_D2 {CU=DKK, ES=E, II=DK0061205630, IR=NRB, LN=BULL SPX X12 NORDNET D2, MIC=FNDK, NS=1, PC=P, SE=193131, SG=250, SN=193131, TK="0.01 to 100,0.05 to 500,0.1", TS=BULL_SPX_X12_NORDNET_D2, TY=W, UQ=1}

Given the above lines, my desired output after running a script should look like this:

BEAR_SPX_X15_NORDNET_D1 - DK0061205473
BULL_SPX_X10_NORDNET_D2 - DK0061205556
BULL_SPX_X12_NORDNET_D2 - DK0061205630

The first alphanumeric value (eg BULL_SPX_X12_NORDNET_D2) is always in the 3rd position of a line. The second alphanumeric value (eg DK0061205630) can be at various positions but it's always preceded by "II=" and is always exactly 12 characters length.

I tried to implement my task with the following script:

 13 regex='II=.\{12\}'
 14 while IFS="" read -r line; do
 15     matchedString=`grep -o $regex littletest.txt | tr -d 'II=,'`
 16     awk /II=/'{print $3, " - ", $matchedString}' littletest.txt > temp.txt
 17 done <littletest.txt

My thought process and intentions/assumptions:

Line 13 defines a regex pattern to match the alphanumeric string preceded with "II="

In line 15 variable "matchedString" gets assigned a value that is extracted from a line via regex, with the preceding "II=" being deleted.

Line 16 uses awk expression to to detect all lines that contain "II=" and then print the third string that is found on every input file's line and also print the value of matched string pattern that was defined in the previous line of the script. So I expect that at this point a pair of extracted patterns (eg BEAR_SPX_X15_NORDNET_D1 - DK0061205473) should be transfered to temp.txt file.

Line 17 is taking an input file for a script to consume.

However, after running the script I did not get the desired output. Here is a sample of what I got:

BEAR_SPX_X15_NORDNET_D1
20200323: #5357 BEAR_SPX_X15_NORDNET_D1 {CU=DKK, ES=E, II=DK0061205473, IR=NRB, LN=BEAR SPX X15 NORDNET D1, MIC=FNDK, NS=1, PC=C, SE=193133, SG=250, SN=193133, TK="0.01 to 100,0.05 to 500,0.1", TS=BEAR_SPX_X15_NORDNET_D1, TY=W, UQ=1}

How could I achieve my desired output that I described earlier?

$ awk -v OFS=' - ' 'match($0,/II=/){print $3, substr($0,RSTART+3,12)}' file
BEAR_SPX_X15_NORDNET_D1 - DK0061205473
BULL_SPX_X10_NORDNET_D2 - DK0061205556
BULL_SPX_X12_NORDNET_D2 - DK0061205630

只是在尝试 awk。

awk  'BEGIN{ FS="[II=, ]+" ; OFS=" - " } {print $3, $8}' file.txt

Using gawk (gnu awk) that supports regex as Field Seperator (FS) , and considering that each line in your file has exactly the same format / same number of fields, this works fine in my tests:

awk '{print $3,$9}' FS="[ ]|II=" OFS=" - " file1
#or FS="[[:space:]]+|II=|[,]" if you might have more than one space between fields

Results

BEAR_SPX_X15_NORDNET_D1 - DK0061205473
BULL_SPX_X10_NORDNET_D2 - DK0061205556
BULL_SPX_X12_NORDNET_D2 - DK0061205630

Since the II= part could be anywhere, this trick could also work with a penalty of parsing the file twice:

paste -d "-" <(awk '{print $3}' file1) <(awk '/II/{print $2}' RS="[ ]" FS="=|," file1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM