简体   繁体   中英

Python regex: remove certain HTML tags and the contents in them

If I have a string that contains this:

<p><span class=love><p>miracle</p>...</span></p><br>love</br>

And I want to remove the string:

<span class=love><p>miracle</p>...</span>

and maybe some other HTML tags. At the same time, the other tags and the contents in them will be reserved.

The result should be like this:

<p></p><br>love</br>

I want to know how to do this using regex pattern? what I have tried :

r=re.compile(r'<span class=love>.*?(?=</span>)')
r.sub('',s)

but it will leave the

</span>

can you help me using re module this time?and i will learn html parser next

First things first: Don't parse HTML using regular expressions

That being said, if there is no additional span tag within that span tag, then you could do it like this:

text = re.sub('<span class=love>.*?</span>', '', text)

On a side note: paragraph tags are not supposed to go within span tags (only phrasing content is).


The expression you have tried, <span class=love>.*?(?=</span>) , is already quite good. The problem is that the lookahead (?=</span>) will never match what it looks ahead for. So the expression will stop immediately before the closing span tag. You now could manually add a closing span at the end, ie <span class=love>.*?(?=</span>)</span> , but that's not really necessary: The .*? is a non-greedy expression. It will try to match as little as possible. So in .*?</span> the .*? will only match until a closing span is found where it immediately stops.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM