简体   繁体   English

如何从字符串中删除 html 个元素,但使用正则表达式排除特定元素

[英]How to remove html elements from a string but exclude a specific element with regex

I have a string '<span>TEST1</span> <span>TEST2</span> <a href="#">TEST3</a>'我有一个字符串'<span>TEST1</span> <span>TEST2</span> <a href="#">TEST3</a>'

I need to remove html tags and leave the text我需要删除 html 标签并留下文本

import re
p = re.compile( '\s*<[^>]+>\s*')
test = p.sub('', '<span>TEST1</span> <span>TEST2</span> <a href="#">TEST3</a>')
print(test)

OUTPUT: TEST1TEST2TEST3 TEST1TEST2TEST3 :测试1测试2测试3

But this removes every html element, how should I change regex so that the output would be like this:但这会删除每个 html 元素,我应该如何更改正则表达式以使 output 像这样:

OUTPUT: TEST1 TEST2 <a href="#">TEST3</a>

You can work with the so-called " Negative Lookaheads ".您可以使用所谓的“ Negative Lookaheads ”。

In your case, you can leave out <a and </a> :在您的情况下,您可以省略<a</a>

(??<a )(?!<\/a>)<[^>]+>

Note the space in <a and the closing parenthesis in </a> so that only the opening and closing tags of an <a> element match and nothing else begins with an a.请注意<a中的空格和</a>中的右括号,以便只有<a>元素的开始和结束标记匹配,而没有其他内容以 a 开头。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM