简体   繁体   中英

Specify UTF-8 encoding to PHP's DOMDocument without meta tag

I've the following HTML code that consists of a Chinese word inside a script tag and some HTML code inside code tag .

<?php

$html = <<<EOD
<!DOCTYPE html>
<html>
    <head>
        <script>
            const str = "訂閱最新指南";
        </script>
    </head>
    <body>
        <pre>
            <code>&lt;img src="cat.jpg"/></code>
        </pre>
        <p>The code for new line is <code>&lt;br/></code> in HTML.</p>
    </body>
</html>
EOD;

I'm parsing this code via PHP's DOMDocument . After saveHTML() , the chinese characters somehow converts to some weird characters. The only solution I found is to add <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> to the <head> tag.

Is there any other way to specify UTF-8 encoding without adding this meta tag?

Here is what all I've tried (none of them work):

// Default way. Chinese characters got encoded
$doc = new DOMDocument();
$doc->loadHTML($html);
echo $doc->saveHTML() . PHP_EOL . PHP_EOL;

// Passed UTF-8 as parameter. Chinese characters got encoded
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->loadHTML($html);
echo $doc->saveHTML() . PHP_EOL . PHP_EOL;

// Set encoding. Chinese characters got encoded
$doc = new DOMDocument();
$doc->encoding = 'UTF-8';
$doc->loadHTML($html);
echo $doc->saveHTML() . PHP_EOL . PHP_EOL;

// Using mb_convert_encoding. Chinese characters got encoded
$doc = new DOMDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
$doc->loadHTML($html);
echo $doc->saveHTML() . PHP_EOL . PHP_EOL;

// Use html_entity_decode to decode. But also enocdes string inside code tag
$doc = new DOMDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
$doc->loadHTML($html);
echo html_entity_decode($doc->saveHTML()) . PHP_EOL . PHP_EOL;

If you want to unconditionally override the encoding to UTF-8, you can do it by prepending the UTF-8 BOM to the file:

$doc = new DOMDocument();
$doc->loadHTML(str_starts_with($html, "\xEF\xBB\xBF")
    ? $html : ("\xEF\xBB\xBF" . $html));

The conditional expression is necessary because the library emits warnings if a double BOM is present at the beginning.

If you merely want to have UTF-8 as the default encoding instead of latin1, there is no clean way to do that. You can use the following dirty hack, though:

$doc = new DOMDocument();
$doc->loadHTML($html);
if ($doc->encoding === null) {
    $doc->loadHTML('<?xml encoding="utf-8" ?>' . $html);
    $node = $doc->firstChild;
    while (!($node instanceof DOMProcessingInstruction)) {
        $node = $node->nextSibling;
    }
    $node->parentNode->removeChild($node);
}

The above has the unfortunate side effect that when the encoding declaration is missing from the file, the parse time is effectively doubled. (Also note that the HTML specification does not prescribe looking at <?xml?> processing instructions to detect the character encoding, meaning this workaround relies on functionality contrary to the specification.)

To make sure characters are not mangled during serialisation back to markup, use $doc->saveHTML($doc) instead of $doc->saveHTML() . This will always result in UTF-8 text, even if the document contains a declaration specifying a different encoding. To obtain the document in another encoding, you will have to convert it afterwards, for example by doing mb_convert_encoding($doc->saveHTML($doc), $doc->xmlEncoding, 'utf-8') (which should convert to the original encoding, although even this may still contradict a <meta> element found in the actual DOM tree).

Given the number of workarounds necessary to use DOMDocument with anything approaching reliability, I'd strongly suggest switching to another parser. Preferably, to another programming language too.

I would simply add the meta as you've suggested, as i don't know what is blocking you with that, just know that this worked fine for me:

$meta='<meta content="text/html; charset=utf-8" http-equiv="Content-Type">';
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->encoding = 'UTF-8';
$doc->loadHTML($meta.$html); /* DOMDocument will put the meta at the right place */

echo $doc->saveHTML() . PHP_EOL . PHP_EOL;

or

echo str_replace($meta,'',$doc->saveHTML()) . PHP_EOL . PHP_EOL;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM