[英]PHP DOMDocument loadHTML UTF-8 encoding correctly with HTML5 doctype
I am using PHP's DOMDocument class with HTML 5 document. 我正在将HTML的DOMDocument类与HTML 5文档一起使用。 But when I do, some utf-8 characters are "changed".
但是当我这样做时,某些utf-8字符被“更改”。 I got
 
我
 
, ’
,
’
, é
,
é
etc.... 等等....
Here is my code. 这是我的代码。
$parsedUrl = 'http://www.futursparents.com/';
$curl = curl_init();
@curl_setopt_array($curl, [
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_TIMEOUT => 60,
CURLOPT_CONNECTTIMEOUT => 30,
CURLOPT_FOLLOWLOCATION => TRUE,
CURLOPT_MAXREDIRS => 5,
CURLOPT_AUTOREFERER => FALSE,
CURLOPT_HEADER => TRUE, // FALSE
CURLOPT_PROTOCOLS => CURLPROTO_HTTP | CURLPROTO_HTTPS,
CURLOPT_REDIR_PROTOCOLS => CURLPROTO_HTTP | CURLPROTO_HTTPS,
CURLOPT_CERTINFO => TRUE,
CURLOPT_LOW_SPEED_LIMIT => 200,
CURLOPT_LOW_SPEED_TIME => 50,
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
CURLOPT_PROXYTYPE => CURLPROXY_HTTP,
CURLOPT_ENCODING => 'gzip,deflate',
CURLOPT_URL => $parsedUrl,
]);
$response = curl_exec($curl);
$info = curl_getinfo($curl);
$error = curl_error($curl);
$headers = trim(substr($response, 0, curl_getinfo($curl, CURLINFO_HEADER_SIZE)));
$content = substr($response, curl_getinfo($curl, CURLINFO_HEADER_SIZE));
curl_close($curl);
libxml_use_internal_errors(true);
$domDoc = new DOMDocument();
print_r($domDoc->encoding); // It's OK => UTF-8
// Got   or s’ or é etc....
print_r($domDoc->saveHTML());
It seem to be an HTML5 doctype with a meta element like so <meta charset=utf-8">
它似乎是带有meta元素的HTML5文档类型,例如
<meta charset=utf-8">
If I add the charset meta tag <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
, It's seem to be OK. 如果我添加字符集元标记
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
,那似乎还可以。
$domDoc->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $content);
// No   or s’ or é etc....
print_r($domDoc->saveHTML());
Do you think this is the right solution? 您认为这是正确的解决方案吗?
I found why. 我找到原因了。
The DOM extension was built on libxml2 whose HTML parser was made for HTML 4. If an HTML5 doctype and a meta element like so <meta charset="utf-8">
HTML code will get interpreted as ISO-8859-something and non-ASCII chars will get converted into HTML entities. DOM扩展基于libxml2构建,其HTML解析器是为HTML 4编写的。如果HTML5 doctype和诸如此类的meta元素
<meta charset="utf-8">
HTML代码将被解释为ISO-8859,而不是ASCII字符将转换为HTML实体。
However the HTML4-like version will work <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
但是,类似HTML4的版本将可以运行
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Reference: UTF-8 with PHP DOMDocument loadHTML? 参考: UTF-8与PHP DOMDocument loadHTML吗?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.