简体   繁体   English

卷曲:从具有错误字符集的站点获取UTF-8数据

[英]Curl: get UTF-8 data from site with incorrect charset

I scrape some sites that occasionally have UTF-8 characters in the title, but that don't specify UTF-8 as the charset (qq.com is an example). 我抓了一些偶尔在标题中有UTF-8字符的网站,但没有指定UTF-8作为字符集(qq.com就是一个例子)。 When I use look at the website in my browser, the data I want to copy (ie the title) looks correct (Japanese or Chinese..not too sure). 当我在浏览器中查看网站时,我要复制的数据(即标题)看起来是正确的(日文或中文......不太确定)。 I can copy the title and paste it into the terminal and it looks exactly the same. 我可以复制标题并将其粘贴到终端中,它看起来完全一样。 I can even write it to the DB and when I retrieve from the DB it still looks the same, and correct. 我甚至可以将它写入数据库,当我从数据库中检索它时,它看起来仍然相同,并且正确。

However, when I use cURL, the data that gets printed is wrong. 但是,当我使用cURL时,打印的数据是错误的。 I can run cURL from the command line or use PHP .. when it's printed to the terminal it's clearly incorrect, and it remains that way when I store it to the DB (remember: the terminal can display these characters properly). 我可以从命令行运行cURL或者使用PHP ..当它打印到终端时显然是不正确的,当我将它存储到数据库时它仍然是这样(请记住:终端可以正确显示这些字符)。 I've tried all eligible combinations of the following: 我已经尝试了以下所有符合条件的组合:

  • Setting CURLOPT_BINARYTRANSFER to true CURLOPT_BINARYTRANSFERtrue
  • mb_convert_encoding($html, 'UTF-8')
  • utf8_encode($html)
  • utf8_decode($html)

None of these display the characters as expected. 这些都不会按预期显示字符。 This is very frustrating since I can get the right characters so easily just by visiting the site, but cURL can't. 这非常令人沮丧,因为只要访问该网站我就能轻松获得正确的角色,但是cURL不能。 I've read a lot of suggestions such as this one: How to get web-page-title with CURL in PHP from web-sites of different CHARSET? 我已经阅读了很多建议,比如这个: 如何使用不同CHARSET的网站在PHP中使用CURL获取网页标题?

The solution in general seems to be "convert the data to UTF-8." 一般来说,解决方案似乎是“将数据转换为UTF-8”。 To be honest, I don't actually know what that means. 说实话,我实际上并不知道这意味着什么。 Don't the above functions convert the data to UTF-8? 上述功能不能将数据转换为UTF-8吗? Why isn't it already UTF-8? 为什么不是UTF-8? What is it, and why does it display properly in some circumstances, but not for cURL? 它是什么,为什么它在某些情况下会正常显示,而不是cURL?

have you tried : 你有没有尝试过 :

$html = iconv("gb2312","utf-8",$html);

the gb2312 was taken from the qq.com headers gb2312取自qq.com标题

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM