简体   繁体   中英

Curl: get UTF-8 data from site with incorrect charset

I scrape some sites that occasionally have UTF-8 characters in the title, but that don't specify UTF-8 as the charset (qq.com is an example). When I use look at the website in my browser, the data I want to copy (ie the title) looks correct (Japanese or Chinese..not too sure). I can copy the title and paste it into the terminal and it looks exactly the same. I can even write it to the DB and when I retrieve from the DB it still looks the same, and correct.

However, when I use cURL, the data that gets printed is wrong. I can run cURL from the command line or use PHP .. when it's printed to the terminal it's clearly incorrect, and it remains that way when I store it to the DB (remember: the terminal can display these characters properly). I've tried all eligible combinations of the following:

  • Setting CURLOPT_BINARYTRANSFER to true
  • mb_convert_encoding($html, 'UTF-8')
  • utf8_encode($html)
  • utf8_decode($html)

None of these display the characters as expected. This is very frustrating since I can get the right characters so easily just by visiting the site, but cURL can't. I've read a lot of suggestions such as this one: How to get web-page-title with CURL in PHP from web-sites of different CHARSET?

The solution in general seems to be "convert the data to UTF-8." To be honest, I don't actually know what that means. Don't the above functions convert the data to UTF-8? Why isn't it already UTF-8? What is it, and why does it display properly in some circumstances, but not for cURL?

have you tried :

$html = iconv("gb2312","utf-8",$html);

the gb2312 was taken from the qq.com headers

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM