AWK 在每一行打印所有正则表达式匹配

Question

我有以下文本输入：

lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed 
do eiusmod <f> tempor
incididunt ut

如文中所见， < ? >不固定，可以在同一行出现 0 次或多次。

仅使用 awk我需要 output 这个：

<a> <b> <c>
<d> <e>
<f>

我试过这个 awk 脚本：

awk '{
  match($0,/<[^>]+>/,a);           // fill array a with matches
  for (i in a) {
    if (match(i, /^[0-9]+$/) != 0) // ignore non numeric indices
      print a[i]
  }
}' somefile.txt

但这只会输出每一行的第一个匹配项：

<a>
<d>
<f>

有没有办法用match()或任何其他内置 function 做到这一点？

Answer 1

使用 GNU awk ，您可以使用其名为FPAT的 OOTB 变量，并可以尝试遵循awk代码。

awk -v FPAT='<[^>]*>' '
NF{
  val=""
  for(i=1;i<=NF;i++){
    val=(val?val OFS:"") $i
  }
  print val
}
'  Input_file

Answer 2

假设没有杂散的尖括号，请使用<或>作为字段分隔符并每隔一个字段打印一次：

awk -F'[<>]' '{for (i=2; i <= NF; i += 2) {printf "<%s> ", $i}; print ""}' data

Answer 3

match()不像你想象的那样工作； 要找到可变数量的匹配项，您需要首先match()第一个模式，剥离该模式，然后match()下一个模式的输入的其余部分，并重复直到当前行中不再有匹配项； 例如：

awk '
{ out=sep=""                                     # init variables for new line
  while (match($0,/<[^>]+>/)) {                  # find 1st match
        out=out sep substr($0,RSTART,RLENGTH)    # build up output line
        $0=substr($0,RSTART+RLENGTH)             # strip off 1st match and prep for next while() check
        sep=OFS                                  # set field separator for follow-on matches
  }
  if (out) print out
}' somefile.txt

另一个想法使用split() function，例如：

awk '
{ n=split($0,a,/[<>]/)                           # split line on dual delimiters "<" and ">"
  out=sep=""
  for (i=2;i<=n;i=i+2) {                         # step through even numbered array entries; assumes line does not contain any standalone "<" or ">" characters !!!
      out=out sep "<" a[i] ">"                   # build output line
      sep=OFS 
  }
  if (out) print out
}
' somefile.txt

这两个生成：

<a> <b> <c>
<d> <e>
<f>

Answer 4

这是一个使用split的简单gnu-awk替代解决方案：

awk '
n = split($0, _, /<[^>]+>/, m) - 1 {
   for (i=1; i<=n; ++i)
      printf "%s", m[i] (i < n ? OFS : ORS)
}' file

<a> <b> <c>
<d> <e>
<f>

Answer 5

这是一个基于正则表达式的简单awk解决方案：

awk '{ gsub(/^[^<]*|[^>]*$/,""); gsub(/>[^<]*</,"> <") } NF'

^{编辑：使用NF而不是$0 != "" ;} ^{谢谢@EdMorton}

对于每一行：

当未找到<时，将所有字符从左侧剥离到第一个< （排除）或直到行尾。
当未找到>时，从右到第一个> （排除）或直到行首剥离所有字符。
用空格字符替换每个>和<对之间的内容。
当结果不为空时打印结果

例子

lorem <a a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed 
do eiusmod <f> tempor
<g>incididunt ut<h><i>
h>ell<o
<j>

output

<a a> <b> <c>
<d> <e>
<f>
<g> <h> <i>
<j>

Answer 6

我将按照以下方式利用 GNU AWK完成此任务，让file.txt内容为

lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed 
do eiusmod <f> tempor
incididunt ut

然后

awk 'BEGIN{FPAT="<[^>]*>"}{$1=$1;print}' file.txt

给出 output

<a> <b> <c>
<d> <e>
<f>

说明：我通知 GNU AWK该字段是<后跟零个或多个 ( * ) 非 ( ^ )- >后跟> 。 对于每一行，我都会执行$1=$1来引发重建，所以现在行是找到由空格连接的字段，然后我print 。

（在 gawk 4.2.1 中测试）

Answer 7

输入

lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed 
do eiusmod <f> tempor
incididunt ut

代码

 mawk 'gsub(">[^<>]*<", "> <", $.(NF = NF))^_*/./' \ FS='<[0-9]+>|^([^<]+|[^<>]*$)|[^>]+$' OFS=

OUTPUT

<a> <b> <c>
<d> <e>
<f>

Answer 8

另一种选择是将gnu awk与 gensub 一起使用。 您可以捕获带有可选周围空间的尖括号并匹配 rest。

在替换使用组 1 中用一个空格包围。

awk '{$0 = gensub(/ *(<[^>]*>) *|[^<>]+/, " \\1 ", "g"); $1=$1}1' file

Output

<a> <b> <c>
<d> <e>
<f>

AWK 在每一行打印所有正则表达式匹配

问题描述

8 个解决方案

解决方案1
6 2022-09-05 02:17:53

解决方案2
3 2022-09-05 02:56:26

解决方案3
2 2022-09-04 21:59:04

解决方案4
2 2022-09-05 06:08:05

解决方案5
2 2022-09-05 10:34:46

例子

output

解决方案6
1 2022-09-05 07:53:52

解决方案7
0 2022-09-04 22:11:31

解决方案8
0 2022-09-05 17:39:41

AWK 在每一行打印所有正则表达式匹配

问题描述

8 个解决方案

解决方案1 6 2022-09-05 02:17:53

解决方案2 3 2022-09-05 02:56:26

解决方案3 2 2022-09-04 21:59:04

解决方案4 2 2022-09-05 06:08:05

解决方案5 2 2022-09-05 10:34:46

例子

output

解决方案6 1 2022-09-05 07:53:52

解决方案7 0 2022-09-04 22:11:31

解决方案8 0 2022-09-05 17:39:41

解决方案1
6 2022-09-05 02:17:53

解决方案2
3 2022-09-05 02:56:26

解决方案3
2 2022-09-04 21:59:04

解决方案4
2 2022-09-05 06:08:05

解决方案5
2 2022-09-05 10:34:46

解决方案6
1 2022-09-05 07:53:52

解决方案7
0 2022-09-04 22:11:31

解决方案8
0 2022-09-05 17:39:41