简体   繁体   English

AWK 在每一行打印所有正则表达式匹配

[英]AWK print all regex matches on every line

I have the following text input:我有以下文本输入:

lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed 
do eiusmod <f> tempor
incididunt ut

As seen in the text, the appearances of < ?如文中所见, < ? > is not fixed and can appear 0 or multiple times on the same line. >不固定,可以在同一行出现 0 次或多次。

Only using awk I need to output this:仅使用 awk我需要 output 这个:

<a> <b> <c>
<d> <e>
<f>

I tried this awk script:我试过这个 awk 脚本:

awk '{
  match($0,/<[^>]+>/,a);           // fill array a with matches
  for (i in a) {
    if (match(i, /^[0-9]+$/) != 0) // ignore non numeric indices
      print a[i]
  }
}' somefile.txt

but this only outputs the first match on every line:但这只会输出每一行的第一个匹配项:

<a>
<d>
<f>

Is there some way of doing this with match() or any other built-in function ?有没有办法用match()任何其他内置 function 做到这一点?

With GNU awk you could use its OOTB variable named FPAT and could try following awk code.使用 GNU awk ,您可以使用其名为FPAT的 OOTB 变量,并可以尝试遵循awk代码。

awk -v FPAT='<[^>]*>' '
NF{
  val=""
  for(i=1;i<=NF;i++){
    val=(val?val OFS:"") $i
  }
  print val
}
'  Input_file

Assuming there are no stray angle brackets, use either < or > as a field separator and print every second field:假设没有杂散的尖括号,请使用<>作为字段分隔符每隔一个字段打印一次:

awk -F'[<>]' '{for (i=2; i <= NF; i += 2) {printf "<%s> ", $i}; print ""}' data

match() doesn't work the way you think it does; match()不像你想象的那样工作; to find a variable number of matches you would need to first match() the first pattern, strip off that pattern, then match() the remainder of the input for the next pattern, and repeat until no more matches in the current line;要找到可变数量的匹配项,您需要首先match()第一个模式,剥离该模式,然后match()下一个模式的输入的其余部分,并重复直到当前行中不再有匹配项; eg:例如:

awk '
{ out=sep=""                                     # init variables for new line
  while (match($0,/<[^>]+>/)) {                  # find 1st match
        out=out sep substr($0,RSTART,RLENGTH)    # build up output line
        $0=substr($0,RSTART+RLENGTH)             # strip off 1st match and prep for next while() check
        sep=OFS                                  # set field separator for follow-on matches
  }
  if (out) print out
}' somefile.txt

Another idea uses the split() function, eg:另一个想法使用split() function,例如:

awk '
{ n=split($0,a,/[<>]/)                           # split line on dual delimiters "<" and ">"
  out=sep=""
  for (i=2;i<=n;i=i+2) {                         # step through even numbered array entries; assumes line does not contain any standalone "<" or ">" characters !!!
      out=out sep "<" a[i] ">"                   # build output line
      sep=OFS 
  }
  if (out) print out
}
' somefile.txt

Both of these generate:这两个生成:

<a> <b> <c>
<d> <e>
<f>

Here is a simple gnu-awk alternative solution using split :这是一个使用split的简单gnu-awk替代解决方案:

awk '
n = split($0, _, /<[^>]+>/, m) - 1 {
   for (i=1; i<=n; ++i)
      printf "%s", m[i] (i < n ? OFS : ORS)
}' file

<a> <b> <c>
<d> <e>
<f>

Here's a simple awk solution based on regexps:这是一个基于正则表达式的简单awk解决方案:

awk '{ gsub(/^[^<]*|[^>]*$/,""); gsub(/>[^<]*</,"> <") } NF'

edit: using NF instead of $0 != "" ;编辑:使用NF而不是$0 != "" ; thanks @EdMorton谢谢@EdMorton

For each line:对于每一行:

  • strip all chars from the left up to the first < (excluded) or up to the end-of-line when < isn't found.当未找到<时,将所有字符从左侧剥离到第一个< (排除)或直到行尾。
  • strip all chars from the right up to the first > (excluded) or up to the start-of-line when > isn't found.当未找到>时,从右到第一个> (排除)或直到行首剥离所有字符。
  • replace what's between each > and < pair with a space character.用空格字符替换每个><对之间的内容。
  • print the result when it isn't empty当结果不为空时打印结果
example例子
lorem <a a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed 
do eiusmod <f> tempor
<g>incididunt ut<h><i>
h>ell<o
<j>
output output
<a a> <b> <c>
<d> <e>
<f>
<g> <h> <i>
<j>

I would harness GNU AWK for this task following way, let file.txt content be我将按照以下方式利用 GNU AWK完成此任务,让file.txt内容为

lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed 
do eiusmod <f> tempor
incididunt ut

then然后

awk 'BEGIN{FPAT="<[^>]*>"}{$1=$1;print}' file.txt

gives output给出 output

<a> <b> <c>
<d> <e>
<f>

Explanation: I inform GNU AWK that field is < followed by zero-or-more ( * ) non( ^ )- > followed by > .说明:我通知 GNU AWK该字段是<后跟零个或多个 ( * ) 非 ( ^ )- >后跟> For each line I do $1=$1 to provoke rebuilt, so now line are found fields joined by space, which I then print .对于每一行,我都会执行$1=$1来引发重建,所以现在行是找到由空格连接的字段,然后我print

(tested in gawk 4.2.1) (在 gawk 4.2.1 中测试)

INPUT输入

lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed 
do eiusmod <f> tempor
incididunt ut

CODE代码

 mawk 'gsub(">[^<>]*<", "> <", $.(NF = NF))^_*/./' \ FS='<[0-9]+>|^([^<]+|[^<>]*$)|[^>]+$' OFS=

OUTPUT OUTPUT

<a> <b> <c>
<d> <e>
<f>

Another option is to use gnu awk with gensub.另一种选择是将gnu awk与 gensub 一起使用。 You can capture the angle brackets with optional surrounding spaces and match the rest.您可以捕获带有可选周围空间的尖括号并匹配 rest。

In the replacement use group 1 surrounded with a single space.在替换使用组 1 中用一个空格包围。

awk '{$0 = gensub(/ *(<[^>]*>) *|[^<>]+/, " \\1 ", "g"); $1=$1}1' file

Output Output

<a> <b> <c>
<d> <e>
<f>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM