[英]What's the fastest way to strip certain html tags in a Python string?
I'd like to strip all html / javascript except for: 除了以下内容,我想剥离所有html / javascript:
<b></b>
<ul></ul>
<li></li>
<a></a>
Thanks. 谢谢。
Do you want a way that's fast or a way that's correct? 您想要快速还是正确的方法? A regex-based approach is unlikely to be correct and may open you up to XSS attacks.
基于正则表达式的方法不太可能正确,并且可能使您容易受到XSS攻击。
You should use an HTML parser like Beautiful Soup or even htmllib
. 您应该使用HTML解析器,例如Beautiful Soup甚至
htmllib
。
Also, <a>
can contain javascript:
href
s and there are also the various on
* attributes which are javascript. 另外,
<a>
可以包含javascript:
href
并且还有各种on
*属性,它们是javascript。 You probably want to strip all of those out. 您可能希望将所有这些剥离。 In general, a whitelist approach is best: only keep attributes (and attribute values) you know are safe.
通常,白名单方法是最好的:仅保留您知道的安全属性(和属性值)。
While I agree with Laurence, there are occasions where a quick and dirty 99% approach gets the job done without creating other problems. 虽然我同意劳伦斯(Laurence)的观点,但有时候99%的快速而肮脏的方法可以完成工作而不会造成其他问题。
Here's an example that demonstrates a regex based approach -- 这是一个演示基于正则表达式的方法的示例-
import re
CLEANBODY_RE = re.compile(r'<(/?)(.+?)>', re.M)
def _repl(match):
tag = match.group(2).split(' ')[0]
if tag == 'p':
return '<%sp>' % match.group(1)
elif tag in ('a', 'br', 'ul', 'li', 'b', 'strong', 'em', 'i'):
return match.group(0)
return u''
def cleanbody(html):
return CLEANBODY_RE.sub(_repl, html)
将您要保留的元素替换为占位符值,然后对所有剩余的<。*>进行正则表达式,最后将占位符替换为相应的html元素。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.