简体   繁体   English

Python正则表达式:删除某些HTML标记及其中的内容

[英]Python regex: remove certain HTML tags and the contents in them

If I have a string that contains this: 如果我有一个包含这个的字符串:

<p><span class=love><p>miracle</p>...</span></p><br>love</br>

And I want to remove the string: 我想删除字符串:

<span class=love><p>miracle</p>...</span>

and maybe some other HTML tags. 也许还有其他一些HTML标签。 At the same time, the other tags and the contents in them will be reserved. 同时,将保留其他标签及其中的内容。

The result should be like this: 结果应该是这样的:

<p></p><br>love</br>

I want to know how to do this using regex pattern? 我想知道如何使用正则表达式模式? what I have tried : 我试过的:

r=re.compile(r'<span class=love>.*?(?=</span>)')
r.sub('',s)

but it will leave the 但它会离开

</span>

can you help me using re module this time?and i will learn html parser next 你能帮助我这次使用re模块吗?接下来我将学习html解析器

First things first: Don't parse HTML using regular expressions 首先要做的事情是: 不要使用正则表达式解析HTML

That being said, if there is no additional span tag within that span tag, then you could do it like this: 话虽这么说,如果该span标签中没有额外的span标签,那么你可以这样做:

text = re.sub('<span class=love>.*?</span>', '', text)

On a side note: paragraph tags are not supposed to go within span tags (only phrasing content is). 旁注:段落标签不应该在span标签内(仅包括措辞内容 )。


The expression you have tried, <span class=love>.*?(?=</span>) , is already quite good. 你尝试过的表达式, <span class=love>.*?(?=</span>) ,已经非常好了。 The problem is that the lookahead (?=</span>) will never match what it looks ahead for. 问题是前瞻(?=</span>)永远不会与它前瞻的相匹配。 So the expression will stop immediately before the closing span tag. 因此表达式将在结束span标记之前立即停止。 You now could manually add a closing span at the end, ie <span class=love>.*?(?=</span>)</span> , but that's not really necessary: The .*? 你现在可以在最后手动添加一个结束范围,即<span class=love>.*?(?=</span>)</span> ,但这不是必需的: .*? is a non-greedy expression. 是一种非贪婪的表达。 It will try to match as little as possible. 它将尝试尽可能少地匹配。 So in .*?</span> the .*? 所以.*?</span> .*? will only match until a closing span is found where it immediately stops. 仅匹配直到找到它立即停止的结束跨度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM