简体   繁体   English

awk:如何从一行中提取 2 个模式,然后将它们连接起来?

[英]awk: how to extract 2 patterns from a single line and then concatenate them?

I want to find 2 patterns in each line and then print them with a dash between them as a separator.我想在每行中找到 2 个模式,然后在它们之间用破折号作为分隔符打印它们。 Here is a sample of lines:以下是行的示例:

20200323: #5357 BEAR_SPX_X15_NORDNET_D1 {CU=DKK, ES=E, II=DK0061205473, IR=NRB, LN=BEAR SPX X15 NORDNET D1, MIC=FNDK, NS=1, PC=C, SE=193133, SG=250, SN=193133, TK="0.01 to 100,0.05 to 500,0.1", TS=BEAR_SPX_X15_NORDNET_D1, TY=W, UQ=1}
20200323: #5358 BULL_SPX_X10_NORDNET_D2 {CU=DKK, ES=E, II=DK0061205556, IR=NRB, LN=BULL SPX X10 NORDNET D2, MIC=FNDK, NS=1, PC=P, SE=193132, SG=250, SN=193132, TK="0.01 to 100,0.05 to 500,0.1", TS=BULL_SPX_X10_NORDNET_D2, TY=W, UQ=1}
20200323: #5359 BULL_SPX_X12_NORDNET_D2 {CU=DKK, ES=E, II=DK0061205630, IR=NRB, LN=BULL SPX X12 NORDNET D2, MIC=FNDK, NS=1, PC=P, SE=193131, SG=250, SN=193131, TK="0.01 to 100,0.05 to 500,0.1", TS=BULL_SPX_X12_NORDNET_D2, TY=W, UQ=1}

Given the above lines, my desired output after running a script should look like this:鉴于以上几行,我运行脚本后所需的输出应如下所示:

BEAR_SPX_X15_NORDNET_D1 - DK0061205473
BULL_SPX_X10_NORDNET_D2 - DK0061205556
BULL_SPX_X12_NORDNET_D2 - DK0061205630

The first alphanumeric value (eg BULL_SPX_X12_NORDNET_D2) is always in the 3rd position of a line.第一个字母数字值(例如 BULL_SPX_X12_NORDNET_D2)始终位于行的第三个位置。 The second alphanumeric value (eg DK0061205630) can be at various positions but it's always preceded by "II=" and is always exactly 12 characters length.第二个字母数字值(例如 DK0061205630)可以在不同的位置,但它总是以“II=”开头,并且总是正好是 12 个字符的长度。

I tried to implement my task with the following script:我尝试使用以下脚本来实现我的任务:

 13 regex='II=.\{12\}'
 14 while IFS="" read -r line; do
 15     matchedString=`grep -o $regex littletest.txt | tr -d 'II=,'`
 16     awk /II=/'{print $3, " - ", $matchedString}' littletest.txt > temp.txt
 17 done <littletest.txt

My thought process and intentions/assumptions:我的思考过程和意图/假设:

Line 13 defines a regex pattern to match the alphanumeric string preceded with "II="第 13 行定义了一个正则表达式模式来匹配以“II=”开头的字母数字字符串

In line 15 variable "matchedString" gets assigned a value that is extracted from a line via regex, with the preceding "II=" being deleted.在第 15 行变量“matchedString”被分配了一个值,该值是通过正则表达式从一行中提取的,前面的“II=”被删除。

Line 16 uses awk expression to to detect all lines that contain "II=" and then print the third string that is found on every input file's line and also print the value of matched string pattern that was defined in the previous line of the script.第 16 行使用 awk 表达式来检测所有包含“II=”的行,然后打印在每个输入文件的行中找到的第三个字符串,并打印在脚本的前一行中定义的匹配字符串模式的值。 So I expect that at this point a pair of extracted patterns (eg BEAR_SPX_X15_NORDNET_D1 - DK0061205473) should be transfered to temp.txt file.所以我希望此时应该将一对提取的模式(例如 BEAR_SPX_X15_NORDNET_D1 - DK0061205473)传输到 temp.txt 文件。

Line 17 is taking an input file for a script to consume.第 17 行获取输入文件以供脚本使用。

However, after running the script I did not get the desired output.但是,运行脚本后,我没有得到所需的输出。 Here is a sample of what I got:这是我得到的样本:

BEAR_SPX_X15_NORDNET_D1
20200323: #5357 BEAR_SPX_X15_NORDNET_D1 {CU=DKK, ES=E, II=DK0061205473, IR=NRB, LN=BEAR SPX X15 NORDNET D1, MIC=FNDK, NS=1, PC=C, SE=193133, SG=250, SN=193133, TK="0.01 to 100,0.05 to 500,0.1", TS=BEAR_SPX_X15_NORDNET_D1, TY=W, UQ=1}

How could I achieve my desired output that I described earlier?我怎样才能达到我之前描述的想要的输出?

$ awk -v OFS=' - ' 'match($0,/II=/){print $3, substr($0,RSTART+3,12)}' file
BEAR_SPX_X15_NORDNET_D1 - DK0061205473
BULL_SPX_X10_NORDNET_D2 - DK0061205556
BULL_SPX_X12_NORDNET_D2 - DK0061205630

只是在尝试 awk。

awk  'BEGIN{ FS="[II=, ]+" ; OFS=" - " } {print $3, $8}' file.txt

Using gawk (gnu awk) that supports regex as Field Seperator (FS) , and considering that each line in your file has exactly the same format / same number of fields, this works fine in my tests:使用支持正则表达式的gawk (gnu awk) 作为字段分隔符 (FS) ,并考虑到文件中的每一行都具有完全相同的格式/相同数量的字段,这在我的测试中工作正常:

awk '{print $3,$9}' FS="[ ]|II=" OFS=" - " file1
#or FS="[[:space:]]+|II=|[,]" if you might have more than one space between fields

Results结果

BEAR_SPX_X15_NORDNET_D1 - DK0061205473
BULL_SPX_X10_NORDNET_D2 - DK0061205556
BULL_SPX_X12_NORDNET_D2 - DK0061205630

Since the II= part could be anywhere, this trick could also work with a penalty of parsing the file twice:由于II=部分可以在任何地方,这个技巧也可以与解析文件两次的惩罚一起使用:

paste -d "-" <(awk '{print $3}' file1) <(awk '/II/{print $2}' RS="[ ]" FS="=|," file1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM