[英]sed one-liner - Find delimiter pair surrounding keyword
I typically work with large XML files, and generally do word counts via grep
to confirm certain statistics. 我通常使用大型XML文件,并且通常通过grep
进行字数统计以确认某些统计信息。
For example, I want to make sure I have at least five instances of widget
in a single xml file via: 例如,我想通过以下方法确保在一个xml文件中至少有五个widget
实例:
cat test.xml | grep -ic widget
Additionally, I just like to be able to log the line that widget
appears on, ie: 另外,我只想记录widget
出现的行,即:
cat test.xml | grep -i widget > ~/log.txt
However, the key information I really need is the block of XML code that widget
appears in. An example file may look like: 但是,我真正需要的关键信息是widget
出现的XML代码块。示例文件可能如下所示:
<test> blah blah
blah blah blah
widget
blah blah blah
</test>
<formula>
blah
<details>
widget
</details>
</formula>
I am trying to get the following output from the sample text above, ie: 我试图从上面的示例文本中获取以下输出,即:
<test>widget</test>
<formula>widget</formula>
Effectively, I'm trying to get a single line with the highest level of markup tags that apply to a block of XML text/code that is surrounding the arbitrary string, widget
. 实际上,我正在尝试使用最高级别的标记标记获得一行,这些标记适用于围绕任意字符串widget
的XML文本/代码块。
Does anyone have any suggestions for implementing this via a command-line one liner? 有没有人有任何建议通过命令行一行实现这一点?
Thank you. 谢谢。
A non-elegant way using both sed
and awk
: 使用sed
和awk
非优雅方式:
sed -ne '/[Ww][Ii][Dd][Gg][Ee][Tt]/,/^<\// {//p}' file.txt | awk 'NR%2==1 { sub(/^[ \t]+/, ""); search = $0 } NR%2==0 { end = $0; sub(/^<\//, "<"); printf "%s%s%s\n", $0, search, end }'
Results: 结果:
<test>widget</test>
<formula>widget</formula>
Explanation: 说明:
## The sed pipe:
sed -ne '/[Ww][Ii][Dd][Gg][Ee][Tt]/,/^<\// {//p}'
## This finds the widget pattern, ignoring case, then finds the last,
## highest level markup tag (these must match the start of the line)
## Ultimately, this prints two lines for each pattern match
## Now the awk pipe:
NR%2==1 { sub(/^[ \t]+/, ""); search = $0 }
## This takes the first line (the widget pattern) and removes leading
## whitespace, saving the pattern in 'search'
NR%2==0 { end = $0; sub(/^<\//, "<"); printf "%s%s%s\n", $0, search, end }
## This finds the next line (which is even), and stores the markup tag in 'end'
## We then remove the slash from this tag and print it, the widget pattern, and
## the saved markup tag
HTH HTH
sed -nr '/^(<[^>]*>).*/{s//\1/;h};/widget/{g;p}' test.xml
prints 版画
<test>
<formula>
Sed only one-liner would be more complex if printed the exact format you want. 如果打印出您想要的确切格式,Sed只有一个内衬会更复杂。
EDIT: 编辑:
You could use /widget/I
instead of /widget/
for case-insensitive matches of widget
in gnu sed, otherwise use [Ww]
for every letter as in the other answer. 您可以使用/widget/I
而不是/widget/
用于gnu sed中widget
不区分大小写的匹配,否则在每个字母中使用[Ww]
,就像在另一个答案中一样。
这可能适合你(GUN sed):
sed '/^<[^/]/!d;:a;/^<\([^>]*>\).*<\/\1/!{$!N;ba};/^<\([^>]*>\).*\(widget\).*<\/\1/s//<\1\2<\/\1/p;d' file
Needs gawk
to have regexp in RS
需要gawk
在RS
有regexp
BEGIN {
# make a stream of words
RS="(\n| )"
}
# match </tag>
/<\// {
s--
next
}
# match <tag>
/</ {
if (!s) {
tag=substr($0, 2)
}
s++
}
$0=="widget" {
print "<" tag $0 "</" tag
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.