[英]How to remove html elements from a string but exclude a specific element with regex
I have a string '<span>TEST1</span> <span>TEST2</span> <a href="#">TEST3</a>'
我有一个字符串
'<span>TEST1</span> <span>TEST2</span> <a href="#">TEST3</a>'
I need to remove html tags and leave the text我需要删除 html 标签并留下文本
import re
p = re.compile( '\s*<[^>]+>\s*')
test = p.sub('', '<span>TEST1</span> <span>TEST2</span> <a href="#">TEST3</a>')
print(test)
OUTPUT: TEST1TEST2TEST3
TEST1TEST2TEST3
:测试1测试2测试3
But this removes every html element, how should I change regex so that the output would be like this:但这会删除每个 html 元素,我应该如何更改正则表达式以使 output 像这样:
OUTPUT: TEST1 TEST2 <a href="#">TEST3</a>
You can work with the so-called " Negative Lookaheads ".您可以使用所谓的“ Negative Lookaheads ”。
In your case, you can leave out <a
and </a>
:在您的情况下,您可以省略
<a
和</a>
:
(??<a )(?!<\/a>)<[^>]+>
Note the space in <a
and the closing parenthesis in </a>
so that only the opening and closing tags of an <a>
element match and nothing else begins with an a.请注意
<a
中的空格和</a>
中的右括号,以便只有<a>
元素的开始和结束标记匹配,而没有其他内容以 a 开头。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.