简体   繁体   English

导致 PHP 在转换为 UTF-8 之前无法检测到正确的字符编码导致数据丢失的已知麻烦字符列表

[英]List of known troublesome characters that causes PHP to fail to detect the proper character encoding before converting to UTF-8 resulting in lost data

PHP isn't always correct, what I write has to always be correct. PHP 并不总是正确的,我写的内容必须始终正确。 In this case an email with a subject contains an en dash character .在这种情况下,带有主题的 email 包含短划线字符 This thread is about detecting oddball characters that when alone (let's say, among otherwise purely ASCII text) is incorrectly detected by PHP.该线程是关于检测 PHP 错误地检测到单独的奇怪字符(比如说,在其他纯 ASCII 文本中)。 I've already determined one static example though my goal here is to create a definitive thread containing as close to a version of drop-in code as we can possibly create.我已经确定了一个 static 示例,尽管我的目标是创建一个明确的线程,其中包含尽可能接近我们可以创建的插入代码版本。

Here is my starting string from the subject header of an email:这是我从 email 的主题 header 开始的字符串:

<?php
//This is AFTER exploding the : of the header and using trim on $p[1]:
$s = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';
//orkut – convite enviado por Lais Piccirillo
?>

Typically the next step is to do the following:通常,下一步是执行以下操作:

$s = imap_mime_header_decode($s);//orkut � convite enviado por Lais Piccirillo

Typically past that point I'd do the following:通常超过这一点,我会做以下事情:

$s = mb_convert_encoding($subject, 'UTF-8', mb_detect_encoding($s));//en dash missing!

Now I received a static answer for an earlier static question .现在,我收到了 static 对较早 static 问题的回答 Eventually I was able to put this working set of code together:最终,我能够将这组工作代码放在一起:

<?php
$s1 = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';

//Attempt to determine the character set:
$en = mb_detect_encoding($s1);//ASCII; wrong!!!
$p = explode('?', $s1, 3)[1];//ISO-8859-1; wrong!!!

//Necessary to decode the q-encoded header text any way FIRST:
$s2 = imap_mime_header_decode($s1);

//Now scan for character exceptions in the original text to compensate for PHP:
if (strpos($s1, '=96') !== false) {$s2 = mb_convert_encoding($s2[0]->text, 'UTF-8', 'CP1252');}
else {$s2 = mb_convert_encoding($s2[0]->text, 'UTF-8');}

//String is finally ready for client output:
echo '<pre>'.print_r($s2,1).'</pre>';//orkut – convite enviado por Lais Piccirillo
?>

Now either I've still programmed this incorrectly and there is something in PHP I'm missing (tried many combinations of html_entity_decode , iconv , mb_convert_encoding and utf8_encode ) or, at least for the moment with PHP 8, we'll be forced to detect specific characters and manually override the encoding as I've done on line 12. In the later case a bug report either needs to be created or likely updated if one specific to this issue already exists.现在要么我仍然编程不正确,并且在 PHP 中一些东西我丢失了(尝试了html_entity_decodeiconvmb_convert_encodingutf8_encode的许多组合),或者,至少在 PHP 的那一刻,我们将被强制检测特定字符并手动覆盖编码,就像我在第 12 行所做的那样。在后一种情况下,需要创建错误报告,或者如果已经存在特定于该问题的错误报告,则可能更新错误报告。

So technically the question is:所以从技术上讲,问题是:

How do we properly detect all character encodings to prevent any characters from being lost during the conversion of strings to UTF-8?我们如何正确检测所有字符编码以防止在将字符串转换为 UTF-8 的过程中丢失任何字符?

If no such proper answer exists valid answers include characters that when among otherwise purely ASCII text results in PHP failing to properly detect the correct character encoding thus resulting in an incorrect UTF-8 encoded string.如果不存在这样的正确答案,则有效答案包括在其他情况下纯 ASCII 文本导致 PHP 未能正确检测到正确的字符编码从而导致不正确的 UTF-8 编码字符串的字符。 Presuming this issue becomes fixed in the future and can be validated against all odd-ball characters listed in all of the other relevant answers then a proper answer can be accepted.假设这个问题在未来得到解决,并且可以针对所有其他相关答案中列出的所有奇数字符进行验证,那么可以接受正确的答案。

You are blaming PHP for something that PHP could not possibly solve:您将 PHP 无法解决的问题归咎于 PHP:

  • $s1 is an ASCII string; $s1一个 ASCII 字符串; just as the string "smiling face emoji" is ASCII, even though it describes the string "".就像字符串“笑脸表情符号”是 ASCII 一样,尽管它描述了字符串“”。
  • $s2 is decoded according to the information you were sent . $s2根据您发送的信息进行解码。 In fact, it's decoded into a raw sequence of bytes, and a label which was provided in the input.实际上,它被解码为原始字节序列,以及输入中提供的 label。

Your actual problem is that the information you were sent was wrong - the system that sent it to you has made the common mistake of mislabelling Windows-1252 as ISO-8859-1.您的实际问题是您发送的信息是错误的 - 发送给您的系统犯了一个常见错误,即错误地将 Windows-1252 标记为 ISO-8859-1。

The difference between the two encodings is that bytes from 0x80 to 0x9F are control characters in ISO 8859 and (mostly) assigned to printable characters in Windows-1252.两种编码之间的区别在于,从 0x80 到 0x9F 的字节是 ISO 8859 中的控制字符,并且(大部分)分配给 Windows-1252 中的可打印字符。 Note that there is no way for any system to automatically tell you which interpretation was intended - either way, there is simply a byte in memory containing 0x96.请注意,任何系统都无法自动告诉您打算使用哪种解释 - 无论哪种方式,memory 中都只有一个包含 0x96 的字节。 However, it is overwhelmingly more likely that any such bytes are intended to be Windows-1252 characters and not the very rarely used extra control characters from ISO 8859, so a common solution is simply to assume that any data labelled ISO-8859-1 is actually Windows-1252 .然而,任何此类字节更有可能是 Windows- 1252字符,而不是 ISO 8859 中很少使用的额外控制字符,因此一个常见的解决方案是简单地假设任何标记为 ISO-8859-1 的数据是实际上是 Windows-1252

That makes the solution really very simple:这使得解决方案非常简单:

// $input is the ASCII string you've received
$input = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';

// Decode the string into its labelled encoding, and string of bytes
$mime_decoded = imap_mime_header_decode($input);
$input_encoding = $mime_decode[0]->charset;
$raw_bytes = $mime_decode[0]->text;

// If it claims to be ISO-8859-1, assume it's lying
if ( $input_encoding === 'ISO-8859-1' ) {
    $input_encoding = 'Windows-1252';
}

// Now convert from a known encoding to UTF-8 for the use of your application
$utf8_string = mb_convert_encoding($raw_bytes, 'UTF-8', $input_encoding);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM