简体   繁体   English

utf-8到iso-8859-1编码问题

[英]utf-8 to iso-8859-1 encoding problem

I'm trying preview the latest post from an rss feed on another website. 我正在尝试从另一个网站上的rss feed中预览最新帖子。 The feed is UTF-8 encoded, whilst the website is ISO-8859-1 encoded. Feed是UTF-8编码的,而网站是ISO-8859-1编码的。 When displaying the title, I'm using; 显示标题时,我正在使用;

 $post_title = 'Blogging – does it pay the bills?';

 echo mb_convert_encoding($post_title, 'iso-8859-1','utf-8');

 // returns: Blogging ? does it pay the bills?
 // expected: Blogging - does it pay the bills?

Note that the hyphen I'm expecting isn't a normal minus sign but some big-ass uber dash. 请注意,我期望的连字符不是正常的负号,而是一些大胆的破折号。 Well, a few pixels longer anyway. 好吧,反正还要长几个像素。 :) Not sure how else to describe it as my keyboard can't produce that character... :)不知道该如何描述它,因为我的键盘无法产生该字符...

mb_convert_encoding only converts the internal encoding - it won't actually change the byte sequences for characters from one character set to another. mb_convert_encoding仅转换内部编码-实际上不会将字符的字节序列从一个字符集更改为另一个字符集。 For that you need iconv . 为此,您需要iconv

mb_internal_encoding( 'UTF-8' );
ini_set( 'default_charset', 'ISO-8859-1' );

$post_title = 'Blogging — does it pay the bills?'; // I used the actual m-dash here to best mimic your scenario

echo iconv( 'UTF-8', 'ISO-8859-1//TRANSLIT', $post_title );

Or, as others have said, just convert out-of-range characters to html entities. 或者,就像其他人所说的那样,只需将超出范围的字符转换为html实体。

I suspect you mean an Em Dash (—). 我怀疑您的意思是Em Dash(—)。 ISO-8859-1 doesn't include this character, so you aren't going to have much luck converting it to that encoding. ISO-8859-1不包含此字符,因此将其转换为该编码不会有太大的运气。

You could use htmlentities() , but I'd suggest moving off ISO-8859-1 to UTF-8 for publication. 您可以使用htmlentities() ,但是我建议将ISO-8859-1移至UTF-8进行发布。

I suppose the following: 我假设以下内容:

  • Your file is actually encoded with UTF-8 您的文件实际上是使用UTF-8编码的
  • Your editor interprets the file with Windows-1252 您的编辑器使用Windows-1252解释文件

The reason for that is that your EM DASH character (U+2014) is represented by – . 其原因是,你的破折号字符(U + 2014)表示为– That's exactly what you get when you interpret the UTF-8 code word of that character (0xE28094) with Windows-1252 (0xE2= â , 0x80= , 0x94= ). 这就是使用Windows-1252 (0xE2 = â ,0x80 = ,0x94 = )解释该字符的UTF-8代码字(0xE28094)时得到的结果。 So you first need to fix your editor encoding. 因此,您首先需要修复您的编辑器编码。

And the reason for the ? 以及原因是? in your output is that ISO 8859-1 doesn't contain the EM DASH character. 在您的输出中是ISO 8859-1不包含EM DASH字符。

It's probably an em dash (U+2014), and what you're trying to do isn't converting the encoding, because the hyphen is a different character. 可能是破折号(U + 2014),并且您要执行的操作不是转换编码,因为连字符是一个不同的字符。 In other words, you want to search for such characters and replace them manually. 换句话说,您要搜索此类字符并手动替换它们。

Better yet, just switch the website to UTF-8. 更好的是,只需将网站切换为UTF-8。 It largely coincides with Latin-1 and is more appropriate for a website in 2009. 它在很大程度上与Latin-1吻合,更适合于2009年的网站。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM