[英]Python regex: remove certain HTML tags and the contents in them
If I have a string that contains this: 如果我有一个包含这个的字符串:
<p><span class=love><p>miracle</p>...</span></p><br>love</br>
And I want to remove the string: 我想删除字符串:
<span class=love><p>miracle</p>...</span>
and maybe some other HTML tags. 也许还有其他一些HTML标签。 At the same time, the other tags and the contents in them will be reserved. 同时,将保留其他标签及其中的内容。
The result should be like this: 结果应该是这样的:
<p></p><br>love</br>
I want to know how to do this using regex pattern? 我想知道如何使用正则表达式模式? what I have tried : 我试过的:
r=re.compile(r'<span class=love>.*?(?=</span>)')
r.sub('',s)
but it will leave the 但它会离开
</span>
can you help me using re module this time?and i will learn html parser next 你能帮助我这次使用re模块吗?接下来我将学习html解析器
First things first: Don't parse HTML using regular expressions 首先要做的事情是: 不要使用正则表达式解析HTML
That being said, if there is no additional span tag within that span tag, then you could do it like this: 话虽这么说,如果该span标签中没有额外的span标签,那么你可以这样做:
text = re.sub('<span class=love>.*?</span>', '', text)
On a side note: paragraph tags are not supposed to go within span tags (only phrasing content is). 旁注:段落标签不应该在span标签内(仅包括措辞内容 )。
The expression you have tried, <span class=love>.*?(?=</span>)
, is already quite good. 你尝试过的表达式, <span class=love>.*?(?=</span>)
,已经非常好了。 The problem is that the lookahead (?=</span>)
will never match what it looks ahead for. 问题是前瞻(?=</span>)
永远不会与它前瞻的相匹配。 So the expression will stop immediately before the closing span tag. 因此表达式将在结束span标记之前立即停止。 You now could manually add a closing span at the end, ie <span class=love>.*?(?=</span>)</span>
, but that's not really necessary: The .*?
你现在可以在最后手动添加一个结束范围,即<span class=love>.*?(?=</span>)</span>
,但这不是必需的: .*?
is a non-greedy expression. 是一种非贪婪的表达。 It will try to match as little as possible. 它将尝试尽可能少地匹配。 So in .*?</span>
the .*?
所以.*?</span>
.*?
will only match until a closing span is found where it immediately stops. 仅匹配直到找到它立即停止的结束跨度。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.