[英]AWK print all regex matches on every line
I have the following text input:我有以下文本输入:
lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed
do eiusmod <f> tempor
incididunt ut
As seen in the text, the appearances of < ?如文中所见, < ? > is not fixed and can appear 0 or multiple times on the same line. >不固定,可以在同一行出现 0 次或多次。
Only using awk I need to output this:仅使用 awk我需要 output 这个:
<a> <b> <c>
<d> <e>
<f>
I tried this awk script:我试过这个 awk 脚本:
awk '{
match($0,/<[^>]+>/,a); // fill array a with matches
for (i in a) {
if (match(i, /^[0-9]+$/) != 0) // ignore non numeric indices
print a[i]
}
}' somefile.txt
but this only outputs the first match on every line:但这只会输出每一行的第一个匹配项:
<a>
<d>
<f>
Is there some way of doing this with match()
or any other built-in function ?有没有办法用match()
或任何其他内置 function 做到这一点?
With GNU awk
you could use its OOTB variable named FPAT
and could try following awk
code.使用 GNU awk
,您可以使用其名为FPAT
的 OOTB 变量,并可以尝试遵循awk
代码。
awk -v FPAT='<[^>]*>' '
NF{
val=""
for(i=1;i<=NF;i++){
val=(val?val OFS:"") $i
}
print val
}
' Input_file
Assuming there are no stray angle brackets, use either <
or >
as a field separator and print every second field:假设没有杂散的尖括号,请使用<
或>
作为字段分隔符并每隔一个字段打印一次:
awk -F'[<>]' '{for (i=2; i <= NF; i += 2) {printf "<%s> ", $i}; print ""}' data
match()
doesn't work the way you think it does; match()
不像你想象的那样工作; to find a variable number of matches you would need to first match()
the first pattern, strip off that pattern, then match()
the remainder of the input for the next pattern, and repeat until no more matches in the current line;要找到可变数量的匹配项,您需要首先match()
第一个模式,剥离该模式,然后match()
下一个模式的输入的其余部分,并重复直到当前行中不再有匹配项; eg:例如:
awk '
{ out=sep="" # init variables for new line
while (match($0,/<[^>]+>/)) { # find 1st match
out=out sep substr($0,RSTART,RLENGTH) # build up output line
$0=substr($0,RSTART+RLENGTH) # strip off 1st match and prep for next while() check
sep=OFS # set field separator for follow-on matches
}
if (out) print out
}' somefile.txt
Another idea uses the split()
function, eg:另一个想法使用split()
function,例如:
awk '
{ n=split($0,a,/[<>]/) # split line on dual delimiters "<" and ">"
out=sep=""
for (i=2;i<=n;i=i+2) { # step through even numbered array entries; assumes line does not contain any standalone "<" or ">" characters !!!
out=out sep "<" a[i] ">" # build output line
sep=OFS
}
if (out) print out
}
' somefile.txt
Both of these generate:这两个生成:
<a> <b> <c>
<d> <e>
<f>
Here is a simple gnu-awk
alternative solution using split
:这是一个使用split
的简单gnu-awk
替代解决方案:
awk '
n = split($0, _, /<[^>]+>/, m) - 1 {
for (i=1; i<=n; ++i)
printf "%s", m[i] (i < n ? OFS : ORS)
}' file
<a> <b> <c>
<d> <e>
<f>
Here's a simple awk
solution based on regexps:这是一个基于正则表达式的简单awk
解决方案:
awk '{ gsub(/^[^<]*|[^>]*$/,""); gsub(/>[^<]*</,"> <") } NF'
edit: using NF
instead of $0 != ""
;编辑:使用NF
而不是$0 != ""
; thanks @EdMorton谢谢@EdMorton
For each line:对于每一行:
<
(excluded) or up to the end-of-line when <
isn't found.当未找到<
时,将所有字符从左侧剥离到第一个<
(排除)或直到行尾。>
(excluded) or up to the start-of-line when >
isn't found.当未找到>
时,从右到第一个>
(排除)或直到行首剥离所有字符。>
and <
pair with a space character.用空格字符替换每个>
和<
对之间的内容。lorem <a a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed
do eiusmod <f> tempor
<g>incididunt ut<h><i>
h>ell<o
<j>
<a a> <b> <c>
<d> <e>
<f>
<g> <h> <i>
<j>
I would harness GNU AWK
for this task following way, let file.txt
content be我将按照以下方式利用 GNU AWK
完成此任务,让file.txt
内容为
lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed
do eiusmod <f> tempor
incididunt ut
then然后
awk 'BEGIN{FPAT="<[^>]*>"}{$1=$1;print}' file.txt
gives output给出 output
<a> <b> <c>
<d> <e>
<f>
Explanation: I inform GNU AWK
that field is <
followed by zero-or-more ( *
) non( ^
)- >
followed by >
.说明:我通知 GNU AWK
该字段是<
后跟零个或多个 ( *
) 非 ( ^
)- >
后跟>
。 For each line I do $1=$1
to provoke rebuilt, so now line are found fields joined by space, which I then print
.对于每一行,我都会执行$1=$1
来引发重建,所以现在行是找到由空格连接的字段,然后我print
。
(tested in gawk 4.2.1) (在 gawk 4.2.1 中测试)
INPUT输入
lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed
do eiusmod <f> tempor
incididunt ut
CODE代码
mawk 'gsub(">[^<>]*<", "> <", $.(NF = NF))^_*/./' \ FS='<[0-9]+>|^([^<]+|[^<>]*$)|[^>]+$' OFS=
OUTPUT OUTPUT
<a> <b> <c>
<d> <e>
<f>
Another option is to use gnu awk
with gensub.另一种选择是将gnu awk
与 gensub 一起使用。 You can capture the angle brackets with optional surrounding spaces and match the rest.您可以捕获带有可选周围空间的尖括号并匹配 rest。
In the replacement use group 1 surrounded with a single space.在替换使用组 1 中用一个空格包围。
awk '{$0 = gensub(/ *(<[^>]*>) *|[^<>]+/, " \\1 ", "g"); $1=$1}1' file
Output Output
<a> <b> <c>
<d> <e>
<f>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.