从python脚本的html输出中提取特定标记

Question

I have a program that should be piped with grep command, the outpu of my program is sth like this: 我有一个应该用grep命令管道的程序，我的程序的输出是这样的：

<cite>www.site.com/sdds/ass</cite>A-"><div Class="sa_mc"><div class="sb_tlst"><h3><a href=

and so on... 等等...

I run a python script: 我运行python脚本：

./python.py | grep -Po '(?<=<cite>)([^</cite>])'

in order to grep every thing between cite tag... 为了grep cite标签之间的每一件事......

Can you help me? 你能帮助我吗？

Answer 1

You need to make a proper use of lookaround feature, your lookbehind is fine but lookahead is not. 你需要正确使用环视功能，你的外观很好，但前瞻不是。 Try this: 尝试这个：

grep -Po "(?<=<cite>).*?(?=</cite>)"

Ex: 例如：

 echo '<cite>www.site.com/sdds/ass</cite>A-"><div Class="sa_mc"><div class="sb_tlst"><h3><a href=' | grep -Po "(?<=<cite>).*?(?=</cite>)"
 www.site.com/sdds/ass

Disclaimer: It's a bad practice to parse XML/HTML with regex. 免责声明：使用正则表达式解析XML / HTML是一种不好的做法。 You should probably use a parser like xmllint instead. 您应该使用像xmllint这样的解析器。

Answer 2

You could also use sed . 你也可以使用sed 。 But it's a bad practice to parse XML/HTML with regex. 但用正则表达式解析XML / HTML是一种不好的做法。

 sed -r 's/^<cite>([^<]*)<\/cite>.*/\1/g' file

Output: 输出：

www.site.com/sdds/ass

从python脚本的html输出中提取特定标记

问题描述

2 个解决方案

解决方案1
1 已采纳 2014-05-22 09:40:17

解决方案2
1 2014-05-22 10:35:59

从python脚本的html输出中提取特定标记

问题描述

2 个解决方案

解决方案1 1 已采纳 2014-05-22 09:40:17

解决方案2 1 2014-05-22 10:35:59

解决方案1
1 已采纳 2014-05-22 09:40:17

解决方案2
1 2014-05-22 10:35:59