Python regex: remove certain HTML tags and the contents in them

Question

If I have a string that contains this:

<p><span class=love><p>miracle</p>...</span></p><br>love</br>

And I want to remove the string:

<span class=love><p>miracle</p>...</span>

and maybe some other HTML tags. At the same time, the other tags and the contents in them will be reserved.

The result should be like this:

<p></p><br>love</br>

I want to know how to do this using regex pattern? what I have tried :

r=re.compile(r'<span class=love>.*?(?=</span>)')
r.sub('',s)

but it will leave the

</span>

can you help me using re module this time?and i will learn html parser next

Answer 1

First things first: Don't parse HTML using regular expressions

That being said, if there is no additional span tag within that span tag, then you could do it like this:

text = re.sub('<span class=love>.*?</span>', '', text)

On a side note: paragraph tags are not supposed to go within span tags (only phrasing content is).

The expression you have tried, .*?(?=) , is already quite good. The problem is that the lookahead (?=) will never match what it looks ahead for. So the expression will stop immediately before the closing span tag. You now could manually add a closing span at the end, ie .*?(?=) , but that's not really necessary: The .*? is a non-greedy expression. It will try to match as little as possible. So in .*? the .*? will only match until a closing span is found where it immediately stops.

Python regex: remove certain HTML tags and the contents in them

Question

1 answers

solution1
7 ACCPTED 2013-07-05 12:27:40

Python regex: remove certain HTML tags and the contents in them

Question

1 answers

solution1 7 ACCPTED 2013-07-05 12:27:40

solution1
7 ACCPTED 2013-07-05 12:27:40