在Python字符串中剥离某些html标签的最快方法是什么？

Question

I'd like to strip all html / javascript except for: 除了以下内容，我想剥离所有html / javascript：

<b></b>
<ul></ul>
<li></li>
<a></a>

Thanks. 谢谢。

Answer 1

Do you want a way that's fast or a way that's correct? 您想要快速还是正确的方法？ A regex-based approach is unlikely to be correct and may open you up to XSS attacks. 基于正则表达式的方法不太可能正确，并且可能使您容易受到XSS攻击。

You should use an HTML parser like Beautiful Soup or even htmllib . 您应该使用HTML解析器，例如Beautiful Soup甚至htmllib 。

Also, <a> can contain javascript: href s and there are also the various on * attributes which are javascript. 另外， <a>可以包含javascript: href并且还有各种on *属性，它们是javascript。 You probably want to strip all of those out. 您可能希望将所有这些剥离。 In general, a whitelist approach is best: only keep attributes (and attribute values) you know are safe. 通常，白名单方法是最好的：仅保留您知道的安全属性（和属性值）。

Answer 2

While I agree with Laurence, there are occasions where a quick and dirty 99% approach gets the job done without creating other problems. 虽然我同意劳伦斯（Laurence）的观点，但有时候99％的快速而肮脏的方法可以完成工作而不会造成其他问题。

Here's an example that demonstrates a regex based approach -- 这是一个演示基于正则表达式的方法的示例-

import re

CLEANBODY_RE = re.compile(r'<(/?)(.+?)>', re.M)

def _repl(match):
    tag = match.group(2).split(' ')[0]
    if tag == 'p':
        return '<%sp>' % match.group(1)
    elif tag in ('a', 'br', 'ul', 'li', 'b', 'strong', 'em', 'i'):
        return match.group(0)
    return u''

def cleanbody(html):
    return CLEANBODY_RE.sub(_repl, html)

Answer 3

将您要保留的元素替换为占位符值，然后对所有剩余的<。*>进行正则表达式，最后将占位符替换为相应的html元素。

在Python字符串中剥离某些html标签的最快方法是什么？

问题描述

3 个解决方案

解决方案1
4 已采纳 2010-12-12 00:04:57

解决方案2
1 2011-11-11 06:01:02

解决方案3
0 2010-12-11 23:28:48

在Python字符串中剥离某些html标签的最快方法是什么？

问题描述

3 个解决方案

解决方案1 4 已采纳 2010-12-12 00:04:57

解决方案2 1 2011-11-11 06:01:02

解决方案3 0 2010-12-11 23:28:48

解决方案1
4 已采纳 2010-12-12 00:04:57

解决方案2
1 2011-11-11 06:01:02

解决方案3
0 2010-12-11 23:28:48