清理不受信任的HTML5

Question

I want to be able to accept HTML from untrusted users and sanitize it so that I can safely include it in pages on my website. 我希望能够接受来自不受信任的用户的HTML并对其进行清理，以便我可以安全地将其包含在我网站的页面中。 By this I mean that markup should not be stripped or escaped, but should be passed through essentially unchanged unless it contains dangerous tags such as <script> or <iframe> , dangerous attributes such as onload , or dangerous CSS properties such as background URLs. 我的意思是标记不应该被剥离或转义，但应该基本上保持不变，除非它包含危险的标签，如<script>或<iframe> ，危险的属性，如onload ，或危险的CSS属性，如后台URL。 (Apparently some older IEs will execute javascript URLs in CSS?) （显然一些较旧的IE会在CSS中执行javascript URL吗？）

Serving the content from a different domain, enclosed in an iframe, is not a good option because there is no way to tell in advance how tall the iframe has to be so it will always look ugly for some pages. 从iframe中包含的不同域提供内容不是一个好选择，因为没有办法事先告诉iframe有多高，所以对于某些页面来说总是看起来很难看。

I looked into HTML Purifier, but it looks like it doesn't support HTML5 yet. 我查看了HTML Purifier，但看起来它还不支持HTML5。 I also looked into Google Caja, but I'm looking for a solution that doesn't use scripts. 我也研究了Google Caja，但我正在寻找一种不使用脚本的解决方案。

Does anyone know of a library that will accomplish this? 有谁知道一个可以实现这个目标的图书馆？ PHP is preferred, but beggars can't be choosers. PHP是首选，但乞丐不能选择。

Answer 1

The black listing approach puts you under upgrade pressure. 黑名单方法会让您面临升级压力。 So each time browsers start to support new standards you MUST draw your sanitizing tool to the same level. 因此，每次浏览器开始支持新标准时，您必须将消毒工具绘制到同一级别。 Such changes happen more often than you think. 这种变化比你想象的更频繁。

White listing (which is achieved by strip_tags with well defined exceptions) of cause shrinks options for your users, but puts you on the save site. 白名单（由具有明确定义的例外情况的strip_tags实现）可以缩小用户的选项，但会将您置于保存站点上。

On my own sites I have the policy to apply the black listing on pages for very trusted users (such as admins) and the whitelisting on all other pages. 在我自己的网站上，我有政策在页面上为非常受信任的用户（例如管理员）和所有其他页面上的白名单应用黑名单。 That sets me into the position to not put much effort into the black listing. 这让我无法在黑名单上投入太多精力。 With more mature role & permission concepts you can even fine grain your black lists and white lists. 通过更成熟的角色和权限概念，您甚至可以对黑名单和白名单进行细化。

UPDATE: I guess you look for this: 更新：我猜你是这样的：

I got the point that strip_tags whitelists on tag level but does accept everything on attribute level. 我认为strip_tags在标签级别上列入白名单，但确实接受了属性级别的所有内容。 Interestingly HTMLpurifier seems to do the whitelisting on attribute level. 有趣的是，HTMLpurifier似乎在属性级别上进行了白名单。 Thanks, was a nice learning here. 谢谢，这是一个很好的学习。

Answer 2

You might be able to do something along the lines of: 您可以做以下事情：

preg_replace('/<\s*iframe\s+[^>]*>.*<\s*\/\s*iframe\s+[^>]*>/i', '', $html);
preg_replace('/<\s*script\s+[^>]*>.*<\s*\/\s*script\s+[^>]*>/i', '', $html);
preg_replace('/\s+onload\s+=\s+"[^"]+"/i', '', $html);

... but then again: you have RegExes, now you have two problems - this might remove more than wanted and leave more than wanted as well. ......但是又一次：你有RegExes，现在你有两个问题 - 这可能会删除超过想要的东西，并留下比想要更多的东西。

But since HTML Purifier is probably the most modern and well suited (and open source) project you should still use that one and maybe make adjustments if you really need them. 但是，由于HTML Purifier可能是最现代，最适合（和开源）的项目，你仍然应该使用它，如果你真的需要它们可能会进行调整。

You can check out one of the following as well: 您也可以查看以下内容之一：

kses - de facto standard, found a way into wordpress as well kses - 事实上的标准，也找到了进入wordpress的方法
htmLawed - an further developed kses htmLawed - 进一步发展的kses
PHP Input Filter - can filter tags and attributes PHP输入过滤器 - 可以过滤标签和属性

Though you also have to make sure that your own page layout doesn't take a hit in including the results due to not closed tags. 虽然您还必须确保自己的页面布局不会因为未关闭的标签而包含结果。

Answer 3

Maybe it's better to go on a different approach? 也许最好采用不同的方法？ How about telling them what they can use? 如何告诉他们他们可以使用什么？

In that case you can use use strip_tags . 在这种情况下，您可以使用strip_tags 。 It will be easier and a lot more controllable this way. 这种方式更容易，更可控。 Very easy to extend in the future aswell 在将来很容易扩展

Answer 4

On Ruby I'm using Nokogiri ( php version ) to parse HTML content. 在Ruby上，我使用Nokogiri （ php版本）来解析HTML内容。 You can parse user's data and remove unnecessary tags or attributes, and then convert it to text. 您可以解析用户的数据并删除不必要的标记或属性，然后将其转换为文本。

phpQuery - another parser. phpQuery - 另一个解析器。

And in PHP there is a strip_tags function. 在PHP中有一个strip_tags函数。

Or you can manualy remove all attributes: 或者您可以手动删除所有属性：

$dom = new DOMDocument;
$dom -> loadHTML( $html );
$xpath = new DOMXPath( $dom );
$nodes = $xpath -> query( "//*[@style]" ); // all elements with style attribute
foreach ( $nodes as $node ) {
    // remove or do what you want
    $node -> removeAttribute( "style" );
}
echo $dom -> saveHTML();

Answer 5

See WdHTMLParser class. 请参阅WdHTMLParser类。 I use this class for my forum. 我在论坛上使用这个课程。

Sample with WdHTMLParser : 使用WdHTMLParser的示例：

This class parse the HTML to an array : 此类将HTML解析为数组：

<div>
    <span>
        <br />
        <span>
        un bout de texte
        </span>
        <input type="text" />
    </span>
</div>

Array : 数组：

Array (
 [0] => Array (
  [name] => div
  [args] => Array ()
  [children] => Array (
   [0] => Array (
    [name] => span
    [args] => Array ()
    [children] => Array (
     [0] => Array (
      [name] => br
      [args] => Array ()
     )
     [1] => Array (
      [name] => span
      [args] => Array ()
      [children] => Array (
       [0] => un bout de texte
      )
     )
     [2] => Array (
      [name] => input
      [args] => Array (
       [type] => text
      )
     )
    )
   )
  )
 )
)

WdHTMLParser array to HTML WdHTMLParser数组为HTML

I use this class on my website to convert array to HTML. 我在我的网站上使用这个类将数组转换为HTML。

voyageWdHTML_allowattr : These attributes will be allowed. voyageWdHTML_allowattr：将允许这些属性。
voyageWdHTML_allowtag : These tags will be allowed. voyageWdHTML_allowtag：允许使用这些标签。
voyageWdHTML_special : Make your own rules. voyageWdHTML_special：制定自己的规则。 Actually, I add "_blank" to each link. 实际上，我在每个链接添加“_blank”。 And replace <br> to new line (\\n) in pre tag. 并将<br>替换为pre标记中的新行（\\ n） 。
fix_javascript : You can to enable/disable this function, but it is useless. fix_javascript：您可以启用/禁用此功能，但它没用。

Sample php : 示例php：

<?php
include "WdHTMLParser.php";
include "parser.php";

list($erreur, $message) = (new Parser())->parseBadHTML("<div>
    <span>
        <a onclick=\"alert('Hacked ! :'(');\">Check javascript</a>
        <script>alert(\"lol\");</script>
    </span>
</div>");

if ($erreur) {
    die("Error : ".$message);
}

echo $message;

Output : 输出：

<div>
    <span>
        <a target="_blank">Check javascript</a>
        <pre>alert("lol");</pre>
    </span>
</div>

My Parser class : 我的解析器类：

<?php
class Parser {
    //private function fix_javascript(&$message) { }

    private function voyageWdHTML_args($tab_args, $objname) {
        $html = "";
        foreach ($tab_args as $attr => $valeur) {
            if ($valeur !== null && $this->voyageWdHTML_allowattr($attr)) {
                $html .= " $attr=\"".htmlentities($valeur)."\"";
            }
        }
        return $html;
    }

    private function voyageWdHTML_allowattr($attr) {
        return in_array($attr, array("align", "face", "size", "href", "title", "target", "src", "color", "style",
                                    "data-class", "data-format"));
    }

    private function voyageWdHTML_allowtag($name) {
        return in_array($name, array("br", "b", "i", "u", "strike", "sub", "sup", "div", "ol", "ul", "li", "font", "span", "code",
                                    "hr", "blockquote", "cite", "a", "img", "p", "pre", "h6", "h5", "h4", "h3", "h2", "h1"));
    }

    private function voyageWdHTML_special(&$obj) {
        if ($obj["name"] == "a") { $obj["args"]["target"] = "_blank"; }
        if ($obj["name"] == "pre") {
            array_filter($obj["children"], function (&$var) {
                if (is_string($var)) { return true; }
                if ($var["name"] == "br") { $var = "\n"; return true; }
                return false;
            });
        }
    }

    private function voyageWdHTML($tableau, $lvl = 0) {
        $html = "";
        foreach ($tableau as $obj) {
            if (is_array($obj)) {
                if (!$this->voyageWdHTML_allowtag($obj["name"])) {
                    $obj["name"] = "pre";
                    if (!isset($obj["children"])) {
                        $obj["children"] = array();
                    }
                }
                if (isset($obj["children"])) {
                    $this->voyageWdHTML_special($obj);
                    $html .= "<{$obj["name"]}{$this->voyageWdHTML_args($obj["args"], $obj["name"])}>{$this->voyageWdHTML($obj["children"], $lvl+1)}</{$obj["name"]}>";
                } else {
                    $html .= "<{$obj["name"]}>";
                }
            } else {
                $html .= $obj;
            }
        }
        return $html;
    }

    public function parseBadHTML($message) {
        $WdHTMLParser = new WdHTMLParser();
        $message = str_replace(array("<br>", "<hr>"), array("<br/>", "<hr/>"), $message);
        $tableau = $WdHTMLParser->parse($message);

        if ($WdHTMLParser->malformed) {
            $retour = $WdHTMLParser->error;
        } else {
            $retour = $this->voyageWdHTML($tableau);

            //$this->fix_javascript($retour);// To make sur
        }

        return array($WdHTMLParser->malformed, $retour);
    }
}

WdHTMLParser class WdHTMLParser类

<?php
class WdHTMLParser {
    private $encoding;
    private $matches;
    private $escaped;
    private $opened = array();
    public $malformed;
    public function parse($html, $namespace = NULL, $encoding = 'utf-8') {
        $this->malformed = false;
        $this->encoding  = $encoding;
        $html            = $this->escapeSpecials($html);
        $this->matches   = preg_split('#<(/?)' . $namespace . '([^>]*)>#', $html, -1, PREG_SPLIT_DELIM_CAPTURE);
        $tree            = $this->buildTree();
        if ($this->escaped) {
            $tree = $this->unescapeSpecials($tree);
        }
        return $tree;
    }
    private function escapeSpecials($html) {
        $html = preg_replace_callback('#<\!--.+-->#sU', array($this, 'escapeSpecials_callback'), $html);
        $html = preg_replace_callback('#<\?.+\?>#sU', array($this, 'escapeSpecials_callback'), $html);
        return $html;
    }
    private function escapeSpecials_callback($m) {
        $this->escaped = true;
        $text          = $m[0];
        $text          = str_replace(array('<', '>'), array("\x01", "\x02"), $text);
        return $text;
    }
    private function unescapeSpecials($tree) {
        return is_array($tree) ? array_map(array($this, 'unescapeSpecials'), $tree) : str_replace(array("\x01", "\x02"), array('<', '>'), $tree);
    }
    private function buildTree() {
        $nodes = array();
        $i     = 0;
        $text  = NULL;
        while (($value = array_shift($this->matches)) !== NULL) {
            switch ($i++ % 3) {
                case 0: {
                    if (trim($value)) {
                        $nodes[] = $value;
                    }
                }
                    break;
                case 1: {
                    $closing = ($value == '/');
                }
                    break;
                case 2: {
                    if (substr($value, -1, 1) == '/') {
                        $nodes[] = $this->parseMarkup(substr($value, 0, -1));
                    } else if ($closing) {
                        $open = array_pop($this->opened);
                        if ($value != $open) {
                            $this->error($value, $open);
                        }
                        return $nodes;
                    } else {
                        $node             = $this->parseMarkup($value);
                        $this->opened[]   = $node['name'];
                        $node['children'] = $this->buildTree($this->matches);
                        $nodes[]          = $node;
                    }
                }
            }
        }
        return $nodes;
    }
    public function parseMarkup($markup) {
        preg_match('#^[^\s]+#', $markup, $matches);
        $name = $matches[0];
        preg_match_all('#\s+([^=]+)\s*=\s*"([^"]+)"#', $markup, $matches, PREG_SET_ORDER);
        $args = array();
        foreach ($matches as $m) {
            $args[$m[1]] = html_entity_decode($m[2], ENT_QUOTES, $this->encoding);
        }
        return array('name' => $name, 'args' => $args);
    }
    public function error($markup, $expected) {
        $this->malformed = true;
        printf('unexpected closing markup "%s", should be "%s"', $markup, $expected);
    }
}

To make sur use, you can use this function (mybb.com) : 要使用sur，您可以使用此功能（mybb.com）：

<?php
class Parser {
    private function fix_javascript(&$message) {
        $js_array = array(
            "#(&\#(0*)106;?|&\#(0*)74;?|&\#x(0*)4a;?|&\#x(0*)6a;?|j)((&\#(0*)97;?|&\#(0*)65;?|a)(&\#(0*)118;?|&\#(0*)86;?|v)(&\#(0*)97;?|&\#(0*)65;?|a)(\s)?(&\#(0*)115;?|&\#(0*)83;?|s)(&\#(0*)99;?|&\#(0*)67;?|c)(&\#(0*)114;?|&\#(0*)82;?|r)(&\#(0*)105;?|&\#(0*)73;?|i)(&\#112;?|&\#(0*)80;?|p)(&\#(0*)116;?|&\#(0*)84;?|t)(&\#(0*)58;?|\:))#i",
            "#(o)(nmouseover\s?=)#i",
            "#(o)(nmouseout\s?=)#i",
            "#(o)(nmousedown\s?=)#i",
            "#(o)(nmousemove\s?=)#i",
            "#(o)(nmouseup\s?=)#i",
            "#(o)(nclick\s?=)#i",
            "#(o)(ndblclick\s?=)#i",
            "#(o)(nload\s?=)#i",
            "#(o)(nsubmit\s?=)#i",
            "#(o)(nblur\s?=)#i",
            "#(o)(nchange\s?=)#i",
            "#(o)(nfocus\s?=)#i",
            "#(o)(nselect\s?=)#i",
            "#(o)(nunload\s?=)#i",
            "#(o)(nkeypress\s?=)#i"
        );

        $message = preg_replace($js_array, "$1<b></b>$2$4", $message);
    }
}

Answer 6

I decided to just use html5lib-python. 我决定只使用html5lib-python。 This is what I came up with: 这就是我想出的：

#!/usr/bin/env python
import sys
from xml.dom.minidom import Node
import html5lib
from html5lib import (HTMLParser, sanitizer, serializer, treebuilders,
                     treewalkers)

parser = HTMLParser(tokenizer=sanitizer.HTMLSanitizer,
                    tree=treebuilders.getTreeBuilder("dom"))
serializer = serializer.htmlserializer.HTMLSerializer(omit_optional_tags=False)

document = parser.parse(sys.stdin.read(), encoding="utf-8")
# find the <html> node
for child in document.childNodes:
    if child.nodeType == Node.ELEMENT_NODE and child.nodeName == 'html':
        htmlNode = child 
# find the <body> node
for child in htmlNode.childNodes:
    if child.nodeType == Node.ELEMENT_NODE and child.nodeName == 'body':
        bodyNode = child
# serialize all children of the <body> node
for child in bodyNode.childNodes:
    stream = treewalkers.getTreeWalker("dom")(child)
    sys.stdout.write(serializer.render(stream, encoding="utf-8"))

Example input: 输入示例：

<script>alert("hax")</script>
<p onload="alert('this is a dangerous attribute')"><b>hello,</b> world</p>

Example output: 示例输出：

&lt;script&gt;alert("hax")&lt;/script&gt;
<p><b>hello,</b> world</p>

Answer 7

I personally use HTML Purifier for this exact purpose: 我个人使用HTML Purifier来实现这个目的：

http://htmlpurifier.org/docs http://htmlpurifier.org/docs

It works well and allows you to customize down to every tag and attribute. 它运行良好，允许您自定义每个标签和属性。 So far I have had no security issues with this plugin. 到目前为止，我对此插件没有任何安全问题。

清理不受信任的HTML5

问题描述

7 个解决方案

解决方案1
6 2014-07-04 12:40:17

解决方案2
2 2014-07-02 13:32:04

解决方案3
2 2014-07-02 13:37:05

解决方案4
2 2014-07-07 11:31:31

解决方案5
1 2014-07-08 18:09:50

Sample with WdHTMLParser : 使用WdHTMLParser的示例：

WdHTMLParser array to HTML WdHTMLParser数组为HTML

Sample php : 示例php：

Output : 输出：

My Parser class : 我的解析器类：

WdHTMLParser class WdHTMLParser类

To make sur use, you can use this function (mybb.com) : 要使用sur，您可以使用此功能（mybb.com）：

解决方案6
0 2014-07-08 21:33:06

解决方案7
0 2014-07-08 22:12:31

清理不受信任的HTML5

问题描述

7 个解决方案

解决方案1 6 2014-07-04 12:40:17

解决方案2 2 2014-07-02 13:32:04

解决方案3 2 2014-07-02 13:37:05

解决方案4 2 2014-07-07 11:31:31

解决方案5 1 2014-07-08 18:09:50

Sample with WdHTMLParser : 使用WdHTMLParser的示例：

WdHTMLParser array to HTML WdHTMLParser数组为HTML

Sample php : 示例php：

Output : 输出：

My Parser class : 我的解析器类：

WdHTMLParser class WdHTMLParser类

To make sur use, you can use this function (mybb.com) : 要使用sur，您可以使用此功能（mybb.com）：

解决方案6 0 2014-07-08 21:33:06

解决方案7 0 2014-07-08 22:12:31

解决方案1
6 2014-07-04 12:40:17

解决方案2
2 2014-07-02 13:32:04

解决方案3
2 2014-07-02 13:37:05

解决方案4
2 2014-07-07 11:31:31

解决方案5
1 2014-07-08 18:09:50

解决方案6
0 2014-07-08 21:33:06

解决方案7
0 2014-07-08 22:12:31