How to remove html elements from a string but exclude a specific element with regex

Question

I have a string '<span>TEST1</span> <span>TEST2</span> <a href="#">TEST3</a>'

I need to remove html tags and leave the text

import re
p = re.compile( '\s*<[^>]+>\s*')
test = p.sub('', '<span>TEST1</span> <span>TEST2</span> <a href="#">TEST3</a>')
print(test)

OUTPUT: TEST1TEST2TEST3

But this removes every html element, how should I change regex so that the output would be like this:

OUTPUT: TEST1 TEST2 <a href="#">TEST3</a>

Answer 1

You can work with the so-called " Negative Lookaheads ".

In your case, you can leave out <a and </a> :

(??<a )(?!<\/a>)<[^>]+>

Note the space in <a and the closing parenthesis in </a> so that only the opening and closing tags of an <a> element match and nothing else begins with an a.

How to remove html elements from a string but exclude a specific element with regex

Question

1 answers

solution1
2 ACCPTED 2022-04-27 10:45:34

How to remove html elements from a string but exclude a specific element with regex

Question

1 answers

solution1 2 ACCPTED 2022-04-27 10:45:34

solution1
2 ACCPTED 2022-04-27 10:45:34