导致 PHP 在转换为 UTF-8 之前无法检测到正确的字符编码导致数据丢失的已知麻烦字符列表

Question

PHP 并不总是正确的，我写的内容必须始终正确。 在这种情况下，带有主题的 email 包含短划线字符。 该线程是关于检测 PHP 错误地检测到单独的奇怪字符（比如说，在其他纯 ASCII 文本中）。 我已经确定了一个 static 示例，尽管我的目标是创建一个明确的线程，其中包含尽可能接近我们可以创建的插入代码版本。

这是我从 email 的主题 header 开始的字符串：

<?php
//This is AFTER exploding the : of the header and using trim on $p[1]:
$s = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';
//orkut – convite enviado por Lais Piccirillo
?>

通常，下一步是执行以下操作：

$s = imap_mime_header_decode($s);//orkut � convite enviado por Lais Piccirillo

通常超过这一点，我会做以下事情：

$s = mb_convert_encoding($subject, 'UTF-8', mb_detect_encoding($s));//en dash missing!

现在，我收到了 static 对较早 static 问题的回答。 最终，我能够将这组工作代码放在一起：

<?php
$s1 = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';

//Attempt to determine the character set:
$en = mb_detect_encoding($s1);//ASCII; wrong!!!
$p = explode('?', $s1, 3)[1];//ISO-8859-1; wrong!!!

//Necessary to decode the q-encoded header text any way FIRST:
$s2 = imap_mime_header_decode($s1);

//Now scan for character exceptions in the original text to compensate for PHP:
if (strpos($s1, '=96') !== false) {$s2 = mb_convert_encoding($s2[0]->text, 'UTF-8', 'CP1252');}
else {$s2 = mb_convert_encoding($s2[0]->text, 'UTF-8');}

//String is finally ready for client output:
echo '<pre>'.print_r($s2,1).'</pre>';//orkut – convite enviado por Lais Piccirillo
?>

现在要么我仍然编程不正确，并且在 PHP 中有一些东西我丢失了（尝试了html_entity_decode 、 iconv 、 mb_convert_encoding和utf8_encode的许多组合），或者，至少在 PHP 的那一刻，我们将被强制检测特定字符并手动覆盖编码，就像我在第 12 行所做的那样。在后一种情况下，需要创建错误报告，或者如果已经存在特定于该问题的错误报告，则可能更新错误报告。

所以从技术上讲，问题是：

我们如何正确检测所有字符编码以防止在将字符串转换为 UTF-8 的过程中丢失任何字符？

如果不存在这样的正确答案，则有效答案包括在其他情况下纯 ASCII 文本导致 PHP 未能正确检测到正确的字符编码从而导致不正确的 UTF-8 编码字符串的字符。 假设这个问题在未来得到解决，并且可以针对所有其他相关答案中列出的所有奇数字符进行验证，那么可以接受正确的答案。

Answer 1

您将 PHP 无法解决的问题归咎于 PHP：

$s1是一个 ASCII 字符串； 就像字符串“笑脸表情符号”是 ASCII 一样，尽管它描述了字符串“”。
$s2根据您发送的信息进行解码。 实际上，它被解码为原始字节序列，以及输入中提供的 label。

您的实际问题是您发送的信息是错误的 - 发送给您的系统犯了一个常见错误，即错误地将 Windows-1252 标记为 ISO-8859-1。

两种编码之间的区别在于，从 0x80 到 0x9F 的字节是 ISO 8859 中的控制字符，并且（大部分）分配给 Windows-1252 中的可打印字符。 请注意，任何系统都无法自动告诉您打算使用哪种解释 - 无论哪种方式，memory 中都只有一个包含 0x96 的字节。 然而，任何此类字节更有可能是 Windows- 1252字符，而不是 ISO 8859 中很少使用的额外控制字符，因此一个常见的解决方案是简单地假设任何标记为 ISO-8859-1 的数据是实际上是 Windows-1252 。

这使得解决方案非常简单：

// $input is the ASCII string you've received
$input = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';

// Decode the string into its labelled encoding, and string of bytes
$mime_decoded = imap_mime_header_decode($input);
$input_encoding = $mime_decode[0]->charset;
$raw_bytes = $mime_decode[0]->text;

// If it claims to be ISO-8859-1, assume it's lying
if ( $input_encoding === 'ISO-8859-1' ) {
    $input_encoding = 'Windows-1252';
}

// Now convert from a known encoding to UTF-8 for the use of your application
$utf8_string = mb_convert_encoding($raw_bytes, 'UTF-8', $input_encoding);

导致 PHP 在转换为 UTF-8 之前无法检测到正确的字符编码导致数据丢失的已知麻烦字符列表

问题描述

1 个解决方案

解决方案1
1 2021-11-22 10:47:28

导致 PHP 在转换为 UTF-8 之前无法检测到正确的字符编码导致数据丢失的已知麻烦字符列表

问题描述

1 个解决方案

解决方案1 1 2021-11-22 10:47:28

解决方案1
1 2021-11-22 10:47:28