将 UTF-8 编码指定为 PHP 的 DOMDocument 而不使用元标记

Question

I've the following HTML code that consists of a Chinese word inside a script tag and some HTML code inside code tag .我有以下 HTML 代码，它由script标签内的中文单词和code tag 内的一些 HTML 代码组成。

<?php

$html = <<<EOD
<!DOCTYPE html>
<html>
    <head>
        <script>
            const str = "訂閱最新指南";
        </script>
    </head>
    <body>
        <pre>
            <code>&lt;img src="cat.jpg"/></code>
        </pre>
        <p>The code for new line is <code>&lt;br/></code> in HTML.</p>
    </body>
</html>
EOD;

I'm parsing this code via PHP's DOMDocument .我正在通过 PHP 的DOMDocument解析这段代码。 After saveHTML() , the chinese characters somehow converts to some weird characters.在saveHTML()之后，中文字符以某种方式转换为一些奇怪的字符。 The only solution I found is to add <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> to the <head> tag.我找到的唯一解决方案是将<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />添加到<head>标记中。

Is there any other way to specify UTF-8 encoding without adding this meta tag?有没有其他方法可以在不添加此元标记的情况下指定 UTF-8 编码？

Here is what all I've tried (none of them work):这是我尝试过的所有方法（它们都不起作用）：

// Default way. Chinese characters got encoded
$doc = new DOMDocument();
$doc->loadHTML($html);
echo $doc->saveHTML() . PHP_EOL . PHP_EOL;

// Passed UTF-8 as parameter. Chinese characters got encoded
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->loadHTML($html);
echo $doc->saveHTML() . PHP_EOL . PHP_EOL;

// Set encoding. Chinese characters got encoded
$doc = new DOMDocument();
$doc->encoding = 'UTF-8';
$doc->loadHTML($html);
echo $doc->saveHTML() . PHP_EOL . PHP_EOL;

// Using mb_convert_encoding. Chinese characters got encoded
$doc = new DOMDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
$doc->loadHTML($html);
echo $doc->saveHTML() . PHP_EOL . PHP_EOL;

// Use html_entity_decode to decode. But also enocdes string inside code tag
$doc = new DOMDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
$doc->loadHTML($html);
echo html_entity_decode($doc->saveHTML()) . PHP_EOL . PHP_EOL;

Answer 1

If you want to unconditionally override the encoding to UTF-8, you can do it by prepending the UTF-8 BOM to the file:如果您想无条件地覆盖 UTF-8 的编码，您可以通过在文件中添加 UTF-8 BOM 来实现：

$doc = new DOMDocument();
$doc->loadHTML(str_starts_with($html, "\xEF\xBB\xBF")
    ? $html : ("\xEF\xBB\xBF" . $html));

The conditional expression is necessary because the library emits warnings if a double BOM is present at the beginning.条件表达式是必要的，因为如果开头出现双 BOM，库会发出警告。

If you merely want to have UTF-8 as the default encoding instead of latin1, there is no clean way to do that.如果您只想将 UTF-8 作为默认编码而不是 latin1，则没有干净的方法可以做到这一点。 You can use the following dirty hack, though:但是，您可以使用以下肮脏的技巧：

$doc = new DOMDocument();
$doc->loadHTML($html);
if ($doc->encoding === null) {
    $doc->loadHTML('<?xml encoding="utf-8" ?>' . $html);
    $node = $doc->firstChild;
    while (!($node instanceof DOMProcessingInstruction)) {
        $node = $node->nextSibling;
    }
    $node->parentNode->removeChild($node);
}

The above has the unfortunate side effect that when the encoding declaration is missing from the file, the parse time is effectively doubled.上面有一个不幸的副作用，当文件中缺少编码声明时，解析时间有效地加倍。 (Also note that the HTML specification does not prescribe looking at <?xml?> processing instructions to detect the character encoding, meaning this workaround relies on functionality contrary to the specification.) （另请注意，HTML 规范并未规定查看<?xml?>处理指令来检测字符编码，这意味着此解决方法依赖于与规范相反的功能。）

To make sure characters are not mangled during serialisation back to markup, use $doc->saveHTML($doc) instead of $doc->saveHTML() .为了确保在序列化回标记期间字符不会被破坏，请使用$doc->saveHTML($doc)而不是$doc->saveHTML() 。 This will always result in UTF-8 text, even if the document contains a declaration specifying a different encoding.这将始终导致 UTF-8 文本，即使文档包含指定不同编码的声明。 To obtain the document in another encoding, you will have to convert it afterwards, for example by doing mb_convert_encoding($doc->saveHTML($doc), $doc->xmlEncoding, 'utf-8') (which should convert to the original encoding, although even this may still contradict a <meta> element found in the actual DOM tree).要以另一种编码方式获取文档，您必须在之后对其进行转换，例如通过执行mb_convert_encoding($doc->saveHTML($doc), $doc->xmlEncoding, 'utf-8') （应该转换为原始编码，尽管即使这样仍可能与实际 DOM 树中的<meta>元素相矛盾）。

Given the number of workarounds necessary to use DOMDocument with anything approaching reliability, I'd strongly suggest switching to another parser.考虑到使用任何接近可靠性的DOMDocument所需的变通方法的数量，我强烈建议切换到另一个解析器。 Preferably, to another programming language too.最好也使用另一种编程语言。

Answer 2

I would simply add the meta as you've suggested, as i don't know what is blocking you with that, just know that this worked fine for me:我会按照您的建议简单地添加元数据，因为我不知道是什么阻碍了您，只知道这对我来说很好：

$meta='<meta content="text/html; charset=utf-8" http-equiv="Content-Type">';
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->encoding = 'UTF-8';
$doc->loadHTML($meta.$html); /* DOMDocument will put the meta at the right place */

echo $doc->saveHTML() . PHP_EOL . PHP_EOL;

or或者

echo str_replace($meta,'',$doc->saveHTML()) . PHP_EOL . PHP_EOL;

将 UTF-8 编码指定为 PHP 的 DOMDocument 而不使用元标记

问题描述

2 个解决方案

解决方案1
0 2021-03-15 09:55:07

解决方案2
-1 2021-03-21 21:43:53

将 UTF-8 编码指定为 PHP 的 DOMDocument 而不使用元标记

问题描述

2 个解决方案

解决方案1 0 2021-03-15 09:55:07

解决方案2 -1 2021-03-21 21:43:53

解决方案1
0 2021-03-15 09:55:07

解决方案2
-1 2021-03-21 21:43:53