简体   繁体   English

如何从用户输入中删除不需要的 HTML 标签,但使用 DOMDocument 将文本保留在 PHP 中的标签内

[英]How to remove unwanted HTML tags from user input but keep text inside the tags in PHP using DOMDocument

I have around ~2 Million stored HTML pages in S3 that contain various HTML.我在 S3 中存储了大约 200 万个包含各种 HTML 的 HTML 页面。 I'm trying to extract only the content from those stored pages, but I wish to retain the HTML structure with certain constraints.我试图仅从这些存储的页面中提取内容,但我希望保留具有某些约束的 HTML 结构。 This HTML is all user-supplied input and should be considered unsafe.这个 HTML 都是用户提供的输入,应该被认为是不安全的。 So for display purposes, I want to retain only some of the HTML tags with a constraint on attributes and attribute values, but still retain all of the properly encoded text content inside even disallowed tags.因此,出于显示目的,我只想保留一些对属性和属性值有约束的 HTML 标签,但仍然保留所有正确编码的文本内容,即使是不允许的标签。

For example, I'd like to allow only specific tags like <p> , <h1> , <h2> , <h3> , <ul> , <ol> , <li> , etc.. But I also want to keep whatever text is found between disallowed tags and maintain its structure.例如,我只想允许特定的标签,如<p><h1><h2><h3><ul><ol><li>等。但我也想保留在不允许的标签之间找到任何文本并保持其结构。 I also want to be able to restrict attributes in each tag or force certain attributes to be applied to specific tags.我还希望能够限制每个标签中的属性或强制将某些属性应用于特定标签。

For example, in the following HTML...例如,在下面的 HTML...

<div id="content">
  Some text...
  <p class="someclass">Hello <span style="color: purple;">PHP</span>!</p>
</div>

I'd like the result to be...我希望结果是...

  Some text...
  <p>Hello PHP!</p>

Thus stripping out the unwanted <div> and <span> tags, the unwanted attributes of all tags, and still maintaining the text inside <div> and <span> .因此去除不需要的<div><span>标签,所有标签的不需要的属性,并仍然保留<div><span>内的文本。

Simply using strip_tags() won't work here.简单地使用strip_tags()在这里不起作用。 So I tried doing the following with DOMDocuemnt .所以我尝试使用DOMDocuemnt执行以下操作

$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

foreach($dom->childNodes as $node) {
    if ($node->nodeName != "p") { // only allow paragraph tags
        $text = $node->nodeValue;
        $node->parentNode->nodeValue .= $text;
        $node->parentNode->removeChild($node);
    }
}

echo $dom->saveHTML();

Which would work on simple cases where there aren't nested tags, but obviously fails when the HTML is complex.这适用于没有嵌套标签的简单情况,但在 HTML 复杂时显然会失败。

I can't exactly call this function recursively on each of the node's child nodes because if I delete the node I lose all further nested children.我不能准确地在每个节点的子节点上递归调用这个函数,因为如果我删除该节点,我将失去所有进一步嵌套的子节点。 Even if I defer node deletion until after the recursion the order of text insertion becomes tricky.即使我将节点删除推迟到递归之后,文本插入的顺序也变得棘手。 Because I try to go deep and return all valid nodes then start concatenating the values of the invalid child nodes together and the result is really messy.因为我尝试深入并返回所有有效节点,然后开始将无效子节点的值连接在一起,结果非常混乱。

For example, let's say I want to allow <p> and <em> in the following HTML例如,假设我想在以下 HTML 中允许<p><em>

<p>Hello <strong>there <em>PHP</em>!</strong></p>

But I don't want to allow <strong> .但我不想允许<strong> If the <strong> has nested <em> my approach gets really confusing.如果<strong>嵌套了<em>我的方法就会变得非常混乱。 Because I'd get something like ...因为我会得到类似...

<p>Hello there !<em>PHP</em></p>

Which is obviously wrong.这显然是错误的。 I realized getting the entire nodeValue is a bad way of doing this.我意识到获取整个nodeValue是一种糟糕的方法。 So instead I started digging into other ways to go through the entire tree one node at a time.因此,我开始研究其他方法来一次一个节点地遍历整个树。 Just finding it very difficult to generalize this solution so that it works sanely every time.只是发现很难概括这个解决方案,以便它每次都能正常工作。

Update更新

A solution to use strip_tags() or the answer provided here isn't helpful to my use case, because the former does not allow me to control the attributes and the latter removes any tag that has attributes.使用strip_tags()解决方案或此处提供的答案对我的用例没有帮助,因为前者不允许我控制属性,而后者删除任何具有属性的标签。 I don't want to remove any tag that has an attribute.我不想删除任何具有属性的标签。 I want to explicitly allow certain tags but still have extensible control over what attributes can be kept/modified in the HTML.我想明确允许某些标签,但仍然可以扩展控制可以在 HTML 中保留/修改哪些属性。

It seems this problem needs to be broken down into two smaller steps in order to generalize the solution.似乎这个问题需要分解成两个更小的步骤才能概括解决方案。

First, Walking the DOM Tree一、遍历DOM树

In order to get to a working solution I found I need to have a sensible way to traverse every node in the DOM tree and inspect it in order to determine if it should be kept as-is or modified.为了得到一个可行的解决方案,我发现我需要有一种合理的方法来遍历 DOM 树中的每个节点并检查它以确定它是否应该保持原样或修改。

So I used wrote the following method as a simple generator extending from DOMDocument .所以我使用了以下方法作为从DOMDocument扩展的简单生成器。

class HTMLFixer extends DOMDocument {
    public function walk(DOMNode $node, $skipParent = false) {
        if (!$skipParent) {
            yield $node;
        }
        if ($node->hasChildNodes()) {
            foreach ($node->childNodes as $n) {
                yield from $this->walk($n);
            }
        }
    }
}

This way doing something like foreach($dom->walk($dom) as $node) gives me a simple loop to traverse the entire tree.通过这种方式执行诸如foreach($dom->walk($dom) as $node)事情给了我一个简单的循环来遍历整个树。 Of course this is a PHP 7 only solution because of the yield from syntax, but I'm OK with that.当然,这是一个 PHP 7 唯一的解决方案,因为语法的yield from ,但我对此表示同意。

Second, Removing Tags but Keeping their Text二、删除标签但保留其文本

The tricky part was figuring out how to keep the text and not the tag while making modifications inside the loop.棘手的部分是弄清楚如何在循环内进行修改时保留文本而不是标签。 So after struggling with a few different approaches I found the simplest way was to build a list of tags to be removed from inside the loop and then remove them later using DOMNode::insertBefore() to append the text nodes up the tree.因此,在尝试了几种不同的方法后,我发现最简单的方法是构建一个要从循环内部删除的标签列表,然后稍后使用DOMNode::insertBefore()将文本节点附加到树上。 That way removing those nodes later has no side effects.这样以后删除这些节点没有副作用。

So I added another generalized stripTags method to this child class for DOMDocument .所以我为DOMDocument这个子类添加了另一个通用的stripTags方法。

public function stripTags(DOMNode $node) {
    $change = $remove = [];
    
    /* Walk the entire tree to build a list of things that need removed */
    foreach($this->walk($node) as $n) {
        if ($n instanceof DOMText || $n instanceof DOMDocument) {
            continue;
        }
        $this->stripAttributes($n); // strips all node attributes not allowed
        $this->forceAttributes($n); // forces any required attributes
        if (!in_array($n->nodeName, $this->allowedTags, true)) {
            // track the disallowed node for removal
            $remove[] = $n;
            // we take all of its child nodes for modification later
            foreach($n->childNodes as $child) {
                $change[] = [$child, $n];
            }
        }
    }
    
    /* Go through the list of changes first so we don't break the
       referential integrity of the tree */
    foreach($change as list($a, $b)) {
        $b->parentNode->insertBefore($a, $b);
    }

    /* Now we can safely remove the old nodes */
    foreach($remove as $a) {
        if ($a->parentNode) {
            $a->parentNode->removeChild($a);
        }
    }
}

The trick here is because we use insertBefore , on the child nodes (ie text node) of the disallowed tags, to move them up to the parent tag, we could easily break the tree (we're copying).这里的技巧是因为我们在不允许的标签的子节点(即文本节点)上使用insertBefore将它们向上移动到父标签,我们可以很容易地破坏树(我们正在复制)。 This confused me a lot at first, but looking at the way the method works, it makes sense.起初这让我很困惑,但看看该方法的工作方式,这是有道理的。 Deferring the move of the node makes sure we don't break parentNode reference when the deeper node is the one that's allowed, but its parent is not in the allowed tags list for example.例如,当更深的节点是允许的,但其父节点不在允许的标签列表中时,推迟节点的移动可确保我们不会破坏parentNode引用。

Complete Solution完整的解决方案

Here's the complete solution I came up with to more generally solve this problem.这是我想出的完整解决方案,以更普遍地解决这个问题。 I'll include in my answer since I struggled to find a lot of the edge cases in doing this with DOMDocument elsewhere.我将包括在我的答案中,因为我在其他地方使用 DOMDocument 很难找到很多边缘情况。 It allows you to specify which tags to allow, and all other tags are removed.它允许您指定允许使用哪些标签,并删除所有其他标签。 It also allows you to specify which attributes are allowed and all other attributes can be removed (even forcing certain attributes on certain tags).它还允许您指定允许哪些属性以及可以删除所有其他属性(甚至强制某些标签上的某些属性)。

class HTMLFixer extends DOMDocument {
    protected static $defaultAllowedTags = [
        'p',
        'h1',
        'h2',
        'h3',
        'h4',
        'h5',
        'h6',
        'pre',
        'code',
        'blockquote',
        'q',
        'strong',
        'em',
        'del',
        'img',
        'a',
        'table',
        'thead',
        'tbody',
        'tfoot',
        'tr',
        'th',
        'td',
        'ul',
        'ol',
        'li',
    ];
    protected static $defaultAllowedAttributes = [
        'a'   => ['href'],
        'img' => ['src'],
        'pre' => ['class'],
    ];
    protected static $defaultForceAttributes = [
        'a' => ['target' => '_blank'],
    ];

    protected $allowedTags       = [];
    protected $allowedAttributes = [];
    protected $forceAttributes   = [];

    public function __construct($version = null, $encoding = null, $allowedTags = [],
                                $allowedAttributes = [], $forceAttributes = []) {
        $this->setAllowedTags($allowedTags ?: static::$defaultAllowedTags);
        $this->setAllowedAttributes($allowedAttributes ?: static::$defaultAllowedAttributes);
        $this->setForceAttributes($forceAttributes ?: static::$defaultForceAttributes);
        parent::__construct($version, $encoding);
    }

    public function setAllowedTags(Array $tags) {
        $this->allowedTags = $tags;
    }

    public function setAllowedAttributes(Array $attributes) {
        $this->allowedAttributes = $attributes;
    }

    public function setForceAttributes(Array $attributes) {
        $this->forceAttributes = $attributes;
    }

    public function getAllowedTags() {
        return $this->allowedTags;
    }

    public function getAllowedAttributes() {
        return $this->allowedAttributes;
    }

    public function getForceAttributes() {
        return $this->forceAttributes;
    }

    public function saveHTML(DOMNode $node = null) {
        if (!$node) {
            $node = $this;
        }
        $this->stripTags($node);
        return parent::saveHTML($node);
    }

    protected function stripTags(DOMNode $node) {
        $change = $remove = [];
        foreach($this->walk($node) as $n) {
            if ($n instanceof DOMText || $n instanceof DOMDocument) {
                continue;
            }
            $this->stripAttributes($n);
            $this->forceAttributes($n);
            if (!in_array($n->nodeName, $this->allowedTags, true)) {
                $remove[] = $n;
                foreach($n->childNodes as $child) {
                    $change[] = [$child, $n];
                }
            }
        }
        foreach($change as list($a, $b)) {
            $b->parentNode->insertBefore($a, $b);
        }
        foreach($remove as $a) {
            if ($a->parentNode) {
                $a->parentNode->removeChild($a);
            }
        }
    }

    protected function stripAttributes(DOMNode $node) {
        $attributes = $node->attributes;
        $len = $attributes->length;
        for ($i = $len - 1; $i >= 0; $i--) {
            $attr = $attributes->item($i);
            if (!isset($this->allowedAttributes[$node->nodeName]) ||
                !in_array($attr->name, $this->allowedAttributes[$node->nodeName], true)) {
                $node->removeAttributeNode($attr);
            }
        }
    }

    protected function forceAttributes(DOMNode $node) {
        if (isset($this->forceAttributes[$node->nodeName])) {
            foreach ($this->forceAttributes[$node->nodeName] as $attribute => $value) {
                $node->setAttribute($attribute, $value);
            }
        }
    }

    protected function walk(DOMNode $node, $skipParent = false) {
        if (!$skipParent) {
            yield $node;
        }
        if ($node->hasChildNodes()) {
            foreach ($node->childNodes as $n) {
                yield from $this->walk($n);
            }
        }
    }
}

So if we have the following HTML所以如果我们有以下 HTML

<div id="content">
  Some text...
  <p class="someclass">Hello <span style="color: purple;">P<em>H</em>P</span>!</p>
</div>

And we only want to allow <p> , and <em> .我们只想允许<p><em>

$html = <<<'HTML'
    <div id="content">
      Some text...
      <p class="someclass">Hello <span style="color: purple;">P<em>H</em>P</span>!</p>
    </div>
HTML;

$dom = new HTMLFixer(null, null, ['p', 'em']);
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

echo $dom->saveHTML($dom);

We'd get something like this...我们会得到这样的东西......

Some text...
      <p>Hello P<em>H</em>P!</p>

Since you can limit this to a specific subtree in the DOM as well the solution could be generalized even more.由于您也可以将其限制为 DOM 中的特定子树,因此该解决方案可以进一步推广。

You can use strip_tags() like this:您可以像这样使用 strip_tags():

$html = '<div id="content">
  Some text...
  <p class="someclass">Hello <span style="color: purple;">PHP</span>!</p>
</div>';
$updatedHTML = strip_tags($text,"<p><h1><h2><h3><ul><ol><li>"); 
   //in second parameter we need to provide which html tag we need to retain.

You can get more information here: http://php.net/manual/en/function.strip-tags.php您可以在此处获取更多信息: http : //php.net/manual/en/function.strip-tags.php

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM