简体   繁体   中英

PHP DomXPath encoding issue after xpath

If I use echo $doc->saveHTML(); It will show the characters accordingly , but once it reaches the xml? at xpath to extract the element , the issues are back again.

I cant seem to display the characters properly. How do i convert it properly. I'm getting:

婢跺繐顒滈拺鍙ョ瀵偓鐞涱偊鈧繑妲戦挅鍕綍婢舵牕顨� 闂€鍌溾敄缂侊綀濮虫稉濠呫€� 娑擃叀顣荤純鎴犵綍閺冭泛鐨绘總鍏呯瑐鐞涳綀鏉藉▎

Instead of proper Chinese:

<head><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta charset="gbk"/></head>

My PHP code:

$html = file_get_contents('http://item.taobao.com/item.htm?spm=a2106.m874.1000384.41.aG3Kbi&id=20811635147&_u=o1ffj7oi9ad3&scm=1029.newlist-0.1.16&ppath=&sku=');
$doc = new DOMDocument();

// Based on Article http://stackoverflow.com/questions/11309194/php-domdocument-failing-to-handle-utf-8-characters/11310258#11310258
$searchPage = mb_convert_encoding($html,"HTML-ENTITIES","GBK");
$doc->loadHTML($searchPage);
// echo $doc->saveHTML(); 

$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*[@id='detail']/div[1]/h3");

foreach ($elements as $e) {
   //echo $e->nodeValue;
   echo mb_convert_encoding($e->nodeValue,"utf-8","gbk");
}

You have the to_encoding and from_encoding parameters the wrong way around in your last call to mb_convert_encoding . The content returned from the XPath query is encoded as UTF-8, but you assumedly want the output encoded as gbk (given that you've set the meta charset to "gbk").

So the final loop should be:

foreach ($elements as $e) {
  echo mb_convert_encoding($e->nodeValue,"gbk","utf-8");
}

The to_encoding is "gbk", and the from_encoding is "utf-8".

That said, the answer given by AgreeOrNot should work too, if you are happy with the page being encoded as UTF-8.


As for how the encoding process works, internally DOMDocument uses UTF-8, so that is why the results you get by from your xpath queries are UTF-8, and why you need to convert that to gbk with mb_convert_encoding if that is the character set you need.

When you call loadHTML , it attempts to detect the source encoding, and then convert the input from that encoding to UTF-8. Unfortunately the detection algorithm doesn't always work very well.

For example, although your example page has set the charset metatag, that metatag is not recognised by loadHTML , so it defaults to assuming the source encoding is Latin1. It would have worked if you had used an http-equiv metatag specifying the Content-Type .

<meta http-equiv="Content-Type" content="text/html; charset=gbk" />

The alternative is to avoid the problem altogether, but by converting all non-ASCII characters to html entities (as you have done). That way it doesn't matter if loadHTML detects the character encoding correctly, because there won't be any characters that need converting.

Since you've already converted the document to html entities, you don't need to convert encoding when you print the result. So:

echo $e->nodeValue;
// echo mb_convert_encoding($e->nodeValue,"utf-8","gbk");

The reason you didn't get the correct output is that you put <meta charset="gbk"/> in your html while it should be <meta charset="utf-8"/> .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM