简体繁体中英

Curl: get UTF-8 data from site with incorrect charset

原文 2012-02-25 20:05:49 5 1 php/ curl/ character-encoding

I scrape some sites that occasionally have UTF-8 characters in the title, but that don't specify UTF-8 as the charset (qq.com is an example). When I use look at the website in my browser, the data I want to copy (ie the title) looks correct (Japanese or Chinese..not too sure). I can copy the title and paste it into the terminal and it looks exactly the same. I can even write it to the DB and when I retrieve from the DB it still looks the same, and correct.

However, when I use cURL, the data that gets printed is wrong. I can run cURL from the command line or use PHP .. when it's printed to the terminal it's clearly incorrect, and it remains that way when I store it to the DB (remember: the terminal can display these characters properly). I've tried all eligible combinations of the following:

Setting CURLOPT_BINARYTRANSFER to true
mb_convert_encoding($html, 'UTF-8')
utf8_encode($html)
utf8_decode($html)

None of these display the characters as expected. This is very frustrating since I can get the right characters so easily just by visiting the site, but cURL can't. I've read a lot of suggestions such as this one: How to get web-page-title with CURL in PHP from web-sites of different CHARSET?

The solution in general seems to be "convert the data to UTF-8." To be honest, I don't actually know what that means. Don't the above functions convert the data to UTF-8? Why isn't it already UTF-8? What is it, and why does it display properly in some circumstances, but not for cURL?

1 answers

have you tried :

$html = iconv("gb2312","utf-8",$html);

the gb2312 was taken from the qq.com headers

PHP Curl UTF-8 Charset

Get the html charset from a site - Meta tags in a NON UTF-8 format

Migrating data, from latin1 charset to UTF-8

How to get data from webhook with content-type: application/x-www-form-urlencoded;charset=UTF-8?

UTF-8 charset issues from MySQL in PHP

Increase charset=“utf-8” Font size in php/mysql site

CodeIgniter charset to UTF-8

Opera charset UTF-8

How to decode data from base64 with utf-8 charset encoding in php

Getting data with UTF-8 charset from MSSQL server using PHP FreeTDS extension

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question PHP Curl UTF-8 Charset Get the html charset from a site - Meta tags in a NON UTF-8 format Migrating data, from latin1 charset to UTF-8 How to get data from webhook with content-type: application/x-www-form-urlencoded;charset=UTF-8? UTF-8 charset issues from MySQL in PHP Increase charset=“utf-8” Font size in php/mysql site CodeIgniter charset to UTF-8 Opera charset UTF-8 How to decode data from base64 with utf-8 charset encoding in php Getting data with UTF-8 charset from MSSQL server using PHP FreeTDS extension

Related Tags

Curl: get UTF-8 data from site with incorrect charset

Question

1 answers

solution1 4 ACCPTED 2012-02-25 20:09:27

solution1
4 ACCPTED 2012-02-25 20:09:27