如何從用戶輸入中刪除不需要的 HTML 標簽，但使用 DOMDocument 將文本保留在 PHP 中的標簽內

Question

我在 S3 中存儲了大約 200 萬個包含各種 HTML 的 HTML 頁面。 我試圖僅從這些存儲的頁面中提取內容，但我希望保留具有某些約束的 HTML 結構。 這個 HTML 都是用戶提供的輸入，應該被認為是不安全的。 因此，出於顯示目的，我只想保留一些對屬性和屬性值有約束的 HTML 標簽，但仍然保留所有正確編碼的文本內容，即使是不允許的標簽。

例如，我只想允許特定的標簽，如<p> 、 <h1> 、 <h2> 、 <h3> 、 <ul> 、 <ol> 、 <li>等。但我也想保留在不允許的標簽之間找到任何文本並保持其結構。 我還希望能夠限制每個標簽中的屬性或強制將某些屬性應用於特定標簽。

例如，在下面的 HTML...

<div id="content">
  Some text...
  <p class="someclass">Hello <span style="color: purple;">PHP</span>!</p>
</div>

我希望結果是...

  Some text...
  <p>Hello PHP!</p>

因此去除不需要的<div>和<span>標簽，所有標簽的不需要的屬性，並仍然保留<div>和<span>內的文本。

簡單地使用strip_tags()在這里不起作用。 所以我嘗試使用DOMDocuemnt執行以下操作。

$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

foreach($dom->childNodes as $node) {
    if ($node->nodeName != "p") { // only allow paragraph tags
        $text = $node->nodeValue;
        $node->parentNode->nodeValue .= $text;
        $node->parentNode->removeChild($node);
    }
}

echo $dom->saveHTML();

這適用於沒有嵌套標簽的簡單情況，但在 HTML 復雜時顯然會失敗。

我不能准確地在每個節點的子節點上遞歸調用這個函數，因為如果我刪除該節點，我將失去所有進一步嵌套的子節點。 即使我將節點刪除推遲到遞歸之后，文本插入的順序也變得棘手。 因為我嘗試深入並返回所有有效節點，然后開始將無效子節點的值連接在一起，結果非常混亂。

例如，假設我想在以下 HTML 中允許<p>和<em>

<p>Hello <strong>there <em>PHP</em>!</strong></p>

但我不想允許<strong> 。 如果<strong>嵌套了<em>我的方法就會變得非常混亂。 因為我會得到類似...

<p>Hello there !<em>PHP</em></p>

這顯然是錯誤的。 我意識到獲取整個nodeValue是一種糟糕的方法。 因此，我開始研究其他方法來一次一個節點地遍歷整個樹。 只是發現很難概括這個解決方案，以便它每次都能正常工作。

更新

使用strip_tags()解決方案或此處提供的答案對我的用例沒有幫助，因為前者不允許我控制屬性，而后者刪除任何具有屬性的標簽。 我不想刪除任何具有屬性的標簽。 我想明確允許某些標簽，但仍然可以擴展控制可以在 HTML 中保留/修改哪些屬性。

Answer 1

似乎這個問題需要分解成兩個更小的步驟才能概括解決方案。

一、遍歷DOM樹

為了得到一個可行的解決方案，我發現我需要有一種合理的方法來遍歷 DOM 樹中的每個節點並檢查它以確定它是否應該保持原樣或修改。

所以我使用了以下方法作為從DOMDocument擴展的簡單生成器。

class HTMLFixer extends DOMDocument {
    public function walk(DOMNode $node, $skipParent = false) {
        if (!$skipParent) {
            yield $node;
        }
        if ($node->hasChildNodes()) {
            foreach ($node->childNodes as $n) {
                yield from $this->walk($n);
            }
        }
    }
}

通過這種方式執行諸如foreach($dom->walk($dom) as $node)事情給了我一個簡單的循環來遍歷整個樹。 當然，這是一個 PHP 7 唯一的解決方案，因為語法的yield from ，但我對此表示同意。

二、刪除標簽但保留其文本

棘手的部分是弄清楚如何在循環內進行修改時保留文本而不是標簽。 因此，在嘗試了幾種不同的方法后，我發現最簡單的方法是構建一個要從循環內部刪除的標簽列表，然后稍后使用DOMNode::insertBefore()將文本節點附加到樹上。 這樣以后刪除這些節點沒有副作用。

所以我為DOMDocument這個子類添加了另一個通用的stripTags方法。

public function stripTags(DOMNode $node) {
    $change = $remove = [];
    
    /* Walk the entire tree to build a list of things that need removed */
    foreach($this->walk($node) as $n) {
        if ($n instanceof DOMText || $n instanceof DOMDocument) {
            continue;
        }
        $this->stripAttributes($n); // strips all node attributes not allowed
        $this->forceAttributes($n); // forces any required attributes
        if (!in_array($n->nodeName, $this->allowedTags, true)) {
            // track the disallowed node for removal
            $remove[] = $n;
            // we take all of its child nodes for modification later
            foreach($n->childNodes as $child) {
                $change[] = [$child, $n];
            }
        }
    }
    
    /* Go through the list of changes first so we don't break the
       referential integrity of the tree */
    foreach($change as list($a, $b)) {
        $b->parentNode->insertBefore($a, $b);
    }

    /* Now we can safely remove the old nodes */
    foreach($remove as $a) {
        if ($a->parentNode) {
            $a->parentNode->removeChild($a);
        }
    }
}

這里的技巧是因為我們在不允許的標簽的子節點（即文本節點）上使用insertBefore將它們向上移動到父標簽，我們可以很容易地破壞樹（我們正在復制）。 起初這讓我很困惑，但看看該方法的工作方式，這是有道理的。 例如，當更深的節點是允許的，但其父節點不在允許的標簽列表中時，推遲節點的移動可確保我們不會破壞parentNode引用。

完整的解決方案

這是我想出的完整解決方案，以更普遍地解決這個問題。 我將包括在我的答案中，因為我在其他地方使用 DOMDocument 很難找到很多邊緣情況。 它允許您指定允許使用哪些標簽，並刪除所有其他標簽。 它還允許您指定允許哪些屬性以及可以刪除所有其他屬性（甚至強制某些標簽上的某些屬性）。

class HTMLFixer extends DOMDocument {
    protected static $defaultAllowedTags = [
        'p',
        'h1',
        'h2',
        'h3',
        'h4',
        'h5',
        'h6',
        'pre',
        'code',
        'blockquote',
        'q',
        'strong',
        'em',
        'del',
        'img',
        'a',
        'table',
        'thead',
        'tbody',
        'tfoot',
        'tr',
        'th',
        'td',
        'ul',
        'ol',
        'li',
    ];
    protected static $defaultAllowedAttributes = [
        'a'   => ['href'],
        'img' => ['src'],
        'pre' => ['class'],
    ];
    protected static $defaultForceAttributes = [
        'a' => ['target' => '_blank'],
    ];

    protected $allowedTags       = [];
    protected $allowedAttributes = [];
    protected $forceAttributes   = [];

    public function __construct($version = null, $encoding = null, $allowedTags = [],
                                $allowedAttributes = [], $forceAttributes = []) {
        $this->setAllowedTags($allowedTags ?: static::$defaultAllowedTags);
        $this->setAllowedAttributes($allowedAttributes ?: static::$defaultAllowedAttributes);
        $this->setForceAttributes($forceAttributes ?: static::$defaultForceAttributes);
        parent::__construct($version, $encoding);
    }

    public function setAllowedTags(Array $tags) {
        $this->allowedTags = $tags;
    }

    public function setAllowedAttributes(Array $attributes) {
        $this->allowedAttributes = $attributes;
    }

    public function setForceAttributes(Array $attributes) {
        $this->forceAttributes = $attributes;
    }

    public function getAllowedTags() {
        return $this->allowedTags;
    }

    public function getAllowedAttributes() {
        return $this->allowedAttributes;
    }

    public function getForceAttributes() {
        return $this->forceAttributes;
    }

    public function saveHTML(DOMNode $node = null) {
        if (!$node) {
            $node = $this;
        }
        $this->stripTags($node);
        return parent::saveHTML($node);
    }

    protected function stripTags(DOMNode $node) {
        $change = $remove = [];
        foreach($this->walk($node) as $n) {
            if ($n instanceof DOMText || $n instanceof DOMDocument) {
                continue;
            }
            $this->stripAttributes($n);
            $this->forceAttributes($n);
            if (!in_array($n->nodeName, $this->allowedTags, true)) {
                $remove[] = $n;
                foreach($n->childNodes as $child) {
                    $change[] = [$child, $n];
                }
            }
        }
        foreach($change as list($a, $b)) {
            $b->parentNode->insertBefore($a, $b);
        }
        foreach($remove as $a) {
            if ($a->parentNode) {
                $a->parentNode->removeChild($a);
            }
        }
    }

    protected function stripAttributes(DOMNode $node) {
        $attributes = $node->attributes;
        $len = $attributes->length;
        for ($i = $len - 1; $i >= 0; $i--) {
            $attr = $attributes->item($i);
            if (!isset($this->allowedAttributes[$node->nodeName]) ||
                !in_array($attr->name, $this->allowedAttributes[$node->nodeName], true)) {
                $node->removeAttributeNode($attr);
            }
        }
    }

    protected function forceAttributes(DOMNode $node) {
        if (isset($this->forceAttributes[$node->nodeName])) {
            foreach ($this->forceAttributes[$node->nodeName] as $attribute => $value) {
                $node->setAttribute($attribute, $value);
            }
        }
    }

    protected function walk(DOMNode $node, $skipParent = false) {
        if (!$skipParent) {
            yield $node;
        }
        if ($node->hasChildNodes()) {
            foreach ($node->childNodes as $n) {
                yield from $this->walk($n);
            }
        }
    }
}

所以如果我們有以下 HTML

<div id="content">
  Some text...
  <p class="someclass">Hello <span style="color: purple;">P<em>H</em>P</span>!</p>
</div>

我們只想允許<p>和<em> 。

$html = <<<'HTML'
    <div id="content">
      Some text...
      <p class="someclass">Hello <span style="color: purple;">P<em>H</em>P</span>!</p>
    </div>
HTML;

$dom = new HTMLFixer(null, null, ['p', 'em']);
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

echo $dom->saveHTML($dom);

我們會得到這樣的東西......

Some text...
      <p>Hello P<em>H</em>P!</p>

由於您也可以將其限制為 DOM 中的特定子樹，因此該解決方案可以進一步推廣。

Answer 2

您可以像這樣使用 strip_tags()：

$html = '<div id="content">
  Some text...
  <p class="someclass">Hello <span style="color: purple;">PHP</span>!</p>
</div>';
$updatedHTML = strip_tags($text,"<p><h1><h2><h3><ul><ol><li>"); 
   //in second parameter we need to provide which html tag we need to retain.

您可以在此處獲取更多信息： http : //php.net/manual/en/function.strip-tags.php

如何從用戶輸入中刪除不需要的 HTML 標簽，但使用 DOMDocument 將文本保留在 PHP 中的標簽內

問題描述

更新

2 個解決方案

解決方案1
2 已采納 2016-09-13 11:45:34

一、遍歷DOM樹

二、刪除標簽但保留其文本

完整的解決方案

解決方案2
0 2016-09-13 10:45:42

如何從用戶輸入中刪除不需要的 HTML 標簽，但使用 DOMDocument 將文本保留在 PHP 中的標簽內

問題描述

更新

2 個解決方案

解決方案1 2 已采納 2016-09-13 11:45:34

一、遍歷DOM樹

二、刪除標簽但保留其文本

完整的解決方案

解決方案2 0 2016-09-13 10:45:42

解決方案1
2 已采納 2016-09-13 11:45:34

解決方案2
0 2016-09-13 10:45:42