将 UTF-8 编码指定为 PHP 的 DOMDocument 而不使用元标记

Question

我有以下 HTML 代码，它由script标签内的中文单词和code tag 内的一些 HTML 代码组成。

<?php

$html = <<<EOD
<!DOCTYPE html>
<html>
    <head>
        <script>
            const str = "訂閱最新指南";
        </script>
    </head>
    <body>
        <pre>
            <code>&lt;img src="cat.jpg"/></code>
        </pre>
        <p>The code for new line is <code>&lt;br/></code> in HTML.</p>
    </body>
</html>
EOD;

我正在通过 PHP 的DOMDocument解析这段代码。 在saveHTML()之后，中文字符以某种方式转换为一些奇怪的字符。 我找到的唯一解决方案是将<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />添加到<head>标记中。

有没有其他方法可以在不添加此元标记的情况下指定 UTF-8 编码？

这是我尝试过的所有方法（它们都不起作用）：

// Default way. Chinese characters got encoded
$doc = new DOMDocument();
$doc->loadHTML($html);
echo $doc->saveHTML() . PHP_EOL . PHP_EOL;

// Passed UTF-8 as parameter. Chinese characters got encoded
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->loadHTML($html);
echo $doc->saveHTML() . PHP_EOL . PHP_EOL;

// Set encoding. Chinese characters got encoded
$doc = new DOMDocument();
$doc->encoding = 'UTF-8';
$doc->loadHTML($html);
echo $doc->saveHTML() . PHP_EOL . PHP_EOL;

// Using mb_convert_encoding. Chinese characters got encoded
$doc = new DOMDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
$doc->loadHTML($html);
echo $doc->saveHTML() . PHP_EOL . PHP_EOL;

// Use html_entity_decode to decode. But also enocdes string inside code tag
$doc = new DOMDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
$doc->loadHTML($html);
echo html_entity_decode($doc->saveHTML()) . PHP_EOL . PHP_EOL;

Answer 1

如果您想无条件地覆盖 UTF-8 的编码，您可以通过在文件中添加 UTF-8 BOM 来实现：

$doc = new DOMDocument();
$doc->loadHTML(str_starts_with($html, "\xEF\xBB\xBF")
    ? $html : ("\xEF\xBB\xBF" . $html));

条件表达式是必要的，因为如果开头出现双 BOM，库会发出警告。

如果您只想将 UTF-8 作为默认编码而不是 latin1，则没有干净的方法可以做到这一点。 但是，您可以使用以下肮脏的技巧：

$doc = new DOMDocument();
$doc->loadHTML($html);
if ($doc->encoding === null) {
    $doc->loadHTML('<?xml encoding="utf-8" ?>' . $html);
    $node = $doc->firstChild;
    while (!($node instanceof DOMProcessingInstruction)) {
        $node = $node->nextSibling;
    }
    $node->parentNode->removeChild($node);
}

上面有一个不幸的副作用，当文件中缺少编码声明时，解析时间有效地加倍。 （另请注意，HTML 规范并未规定查看<?xml?>处理指令来检测字符编码，这意味着此解决方法依赖于与规范相反的功能。）

为了确保在序列化回标记期间字符不会被破坏，请使用$doc->saveHTML($doc)而不是$doc->saveHTML() 。 这将始终导致 UTF-8 文本，即使文档包含指定不同编码的声明。 要以另一种编码方式获取文档，您必须在之后对其进行转换，例如通过执行mb_convert_encoding($doc->saveHTML($doc), $doc->xmlEncoding, 'utf-8') （应该转换为原始编码，尽管即使这样仍可能与实际 DOM 树中的<meta>元素相矛盾）。

考虑到使用任何接近可靠性的DOMDocument所需的变通方法的数量，我强烈建议切换到另一个解析器。 最好也使用另一种编程语言。

Answer 2

我会按照您的建议简单地添加元数据，因为我不知道是什么阻碍了您，只知道这对我来说很好：

$meta='<meta content="text/html; charset=utf-8" http-equiv="Content-Type">';
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->encoding = 'UTF-8';
$doc->loadHTML($meta.$html); /* DOMDocument will put the meta at the right place */

echo $doc->saveHTML() . PHP_EOL . PHP_EOL;

或者

echo str_replace($meta,'',$doc->saveHTML()) . PHP_EOL . PHP_EOL;

将 UTF-8 编码指定为 PHP 的 DOMDocument 而不使用元标记

问题描述

2 个解决方案

解决方案1
0 2021-03-15 09:55:07

解决方案2
-1 2021-03-21 21:43:53

将 UTF-8 编码指定为 PHP 的 DOMDocument 而不使用元标记

问题描述

2 个解决方案

解决方案1 0 2021-03-15 09:55:07

解决方案2 -1 2021-03-21 21:43:53

解决方案1
0 2021-03-15 09:55:07

解决方案2
-1 2021-03-21 21:43:53