简体   繁体   中英

encoding issues with gbk pages , domxpath

I'm trying to Curl the link below which is in GBK. I want to extract the title of the product and image. but when i echo the document to test if it's working , i dont get the chinese character. I need to extract using domxpath and display the characters on my website, same characters , not weird characters. How does this actually work?

$ch = curl_init("http://item.taobao.com/item.htm?spm=a2106.m874.1000384.41.aG3Kbi&id=20811635147&_u=o1ffj7oi9ad3&scm=1029.newlist-0.1.16&ppath=&sku=");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
$content = curl_exec($ch);
curl_close($ch);


$doc = new DOMDocument();
$searchPage = mb_convert_encoding($content, 'utf-8', "auto");
$doc->loadHTML($searchPage);
echo $doc->saveHTML(); 

检查php.ini中的mbstring.language是否设置为GBK,还是显式使用

$searchPage = mb_convert_encoding($content, 'utf-8', "gb18030");

I have the same problem. and the solution work for me:

  $str = file_get_contents($url);
  $str = mb_convert_encoding($str,'utf-8', "gb18030");
  $str = str_replace('<head>', '<head><meta HTTP-EQUIV=Content-Type content="text/html;charset=utf-8">', $str);
  $dom = new DOMDocument('1.0');
  @$dom->loadHTML($str);

DOMDocument read your encoding declare in html, put it immediate after head

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM