DomDocument 和特殊字符

Question

这是我的代码：

$oDom = new DOMDocument();
$oDom->loadHTML("èàéìòù");
echo $oDom->saveHTML();

这是 output：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>&Atilde;&uml;&Atilde;&nbsp;&Atilde;&copy;&Atilde;&not;&Atilde;&sup2;&Atilde;&sup1;</p></body></html>

我想要这个 output：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>èàéìòù</p></body></html>

我试过...

$oDom = new DomDocument('4.0', 'UTF-8');

或 1.0 和其他东西，但什么都没有。

另一件事......有没有办法获得相同的原样HTML？ 例如，在输入<p>hello!</p>中使用此 html 获得相同的 output <p>hello!</p>使用 DOMDocument 仅用于解析 DOM 并在标签内进行一些替换。

Answer 1

解决方案：

$oDom = new DOMDocument();
$oDom->encoding = 'utf-8';
$oDom->loadHTML( utf8_decode( $sString ) ); // important!

$sHtml = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">';
$sHtml .= $oDom->saveHTML( $oDom->documentElement ); // important!

saveHTML()方法在指定节点时的工作方式不同。 您可以使用主节点 ( $oDom->documentElement ) 手动添加所需的!DOCTYPE 。 另一个重要的事情是utf8_decode() 。 就我而言， DOMDocument class 的所有属性和其他方法都不会产生预期的结果。

Answer 2

加载 HTML后尝试设置编码类型。

$dom = new DOMDocument();
$dom->loadHTML($data);
$dom->encoding = 'utf-8';
echo $dom->saveHTML();

另一种方式

Answer 3

$dom = new DomDocument();
$str = htmlentities($str);
$dom->loadHTML(utf8_decode($str));
$dom->encoding = 'utf-8';
.
.
.
$str = $dom->saveHTML();
$str = html_entity_decode($str);

上面的代码对我有用。

Answer 4

根据php.net 手册页上的用户评论，这个问题似乎是已知的。 那里建议的解决方案包括

<meta http-equiv="content-type" content="text/html; charset=utf-8">

在将任何带有非 ASCII 字符的字符串放入文档之前。

另一个黑客建议把

<?xml encoding="UTF-8">

作为文档中的第一个文本，然后在最后将其删除。

讨厌的东西。 对我来说闻起来像个虫子。

Answer 5

这边走：

/**
 * @param string $text
 * @return DOMDocument
 */
private function buildDocument($text)
{
    $dom = new DOMDocument();

    libxml_use_internal_errors(true);
    $dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $text);
    libxml_use_internal_errors(false);

    return $dom;
}

Answer 6

我不知道为什么标记的答案对我的问题不起作用。 但是这个做到了。

参考： https://www.php.net/manual/en/class.domdocument.php

<?php

            // checks if the content we're receiving isn't empty, to avoid the warning
            if ( empty( $content ) ) {
                return false;
            }

            // converts all special characters to utf-8
            $content = mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8');

            // creating new document
            $doc = new DOMDocument('1.0', 'utf-8');

            //turning off some errors
            libxml_use_internal_errors(true);

            // it loads the content without adding enclosing html/body tags and also the doctype declaration
            $doc->LoadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

            // do whatever you want to do with this code now

?>

Answer 7

对我有用的是：

$doc->loadHTML(mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'));

信用： https://davidwalsh.name/domdocument-utf8-problem

Answer 8

看起来您只需要在创建 DOMDocument object 时设置替代实体。

Answer 9

以上都没有为我工作，但这个工作完成了：

$fileContent = file_get_contents('my_file.html');
$dom = new DOMDocument();
@$dom->loadHTML(mb_convert_encoding($fileContent, 'HTML-ENTITIES', 'UTF-8'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$dom->encoding = 'utf-8';
$html = $dom->saveHTML();
$html = html_entity_decode($html, ENT_COMPAT, 'UTF-8');
echo $html;

DomDocument 和特殊字符

问题描述

9 个解决方案

解决方案1
57 已采纳 2011-07-08 06:11:19

解决方案2
7 2011-07-04 15:32:57

解决方案3
6 2020-02-28 07:34:23

解决方案4
5

解决方案5
4 2018-10-31 12:00:43

解决方案6
4 2019-10-09 03:38:48

解决方案7
0 2020-03-20 07:11:44

解决方案8
0 2011-07-04 15:15:18

解决方案9
0 2021-04-22 08:40:45

DomDocument 和特殊字符

问题描述

9 个解决方案

解决方案1 57 已采纳 2011-07-08 06:11:19

解决方案2 7 2011-07-04 15:32:57

解决方案3 6 2020-02-28 07:34:23

解决方案4 5

解决方案5 4 2018-10-31 12:00:43

解决方案6 4 2019-10-09 03:38:48

解决方案7 0 2020-03-20 07:11:44

解决方案8 0 2011-07-04 15:15:18

解决方案9 0 2021-04-22 08:40:45

解决方案1
57 已采纳 2011-07-08 06:11:19

解决方案2
7 2011-07-04 15:32:57

解决方案3
6 2020-02-28 07:34:23

解决方案4
5

解决方案5
4 2018-10-31 12:00:43

解决方案6
4 2019-10-09 03:38:48

解决方案7
0 2020-03-20 07:11:44

解决方案8
0 2011-07-04 15:15:18

解决方案9
0 2021-04-22 08:40:45