简体   繁体   中英

How to remove html elements from a string but exclude a specific element with regex

I have a string '<span>TEST1</span> <span>TEST2</span> <a href="#">TEST3</a>'

I need to remove html tags and leave the text

import re
p = re.compile( '\s*<[^>]+>\s*')
test = p.sub('', '<span>TEST1</span> <span>TEST2</span> <a href="#">TEST3</a>')
print(test)

OUTPUT: TEST1TEST2TEST3

But this removes every html element, how should I change regex so that the output would be like this:

OUTPUT: TEST1 TEST2 <a href="#">TEST3</a>

You can work with the so-called " Negative Lookaheads ".

In your case, you can leave out <a and </a> :

(??<a )(?!<\/a>)<[^>]+>

Note the space in <a and the closing parenthesis in </a> so that only the opening and closing tags of an <a> element match and nothing else begins with an a.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM