[英]extract specific tag from html output of a python script
I have a program that should be piped with grep command, the outpu of my program is sth like this: 我有一个应该用grep命令管道的程序,我的程序的输出是这样的:
<cite>www.site.com/sdds/ass</cite>A-"><div Class="sa_mc"><div class="sb_tlst"><h3><a href=
and so on... 等等...
I run a python script: 我运行python脚本:
./python.py | grep -Po '(?<=<cite>)([^</cite>])'
in order to grep every thing between cite
tag... 为了grep
cite
标签之间的每一件事......
Can you help me? 你能帮助我吗?
You need to make a proper use of lookaround feature, your lookbehind is fine but lookahead is not. 你需要正确使用环视功能,你的外观很好,但前瞻不是。 Try this:
尝试这个:
grep -Po "(?<=<cite>).*?(?=</cite>)"
Ex: 例如:
echo '<cite>www.site.com/sdds/ass</cite>A-"><div Class="sa_mc"><div class="sb_tlst"><h3><a href=' | grep -Po "(?<=<cite>).*?(?=</cite>)"
www.site.com/sdds/ass
Disclaimer: It's a bad practice to parse XML/HTML with regex. 免责声明:使用正则表达式解析XML / HTML是一种不好的做法。 You should probably use a parser like xmllint instead.
您应该使用像xmllint这样的解析器。
You could also use sed
. 你也可以使用
sed
。 But it's a bad practice to parse XML/HTML with regex. 但用正则表达式解析XML / HTML是一种不好的做法。
sed -r 's/^<cite>([^<]*)<\/cite>.*/\1/g' file
Output: 输出:
www.site.com/sdds/ass
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.