List of known troublesome characters that causes PHP to fail to detect the proper character encoding before converting to UTF-8 resulting in lost data

Question

PHP isn't always correct, what I write has to always be correct. In this case an email with a subject contains an en dash character . This thread is about detecting oddball characters that when alone (let's say, among otherwise purely ASCII text) is incorrectly detected by PHP. I've already determined one static example though my goal here is to create a definitive thread containing as close to a version of drop-in code as we can possibly create.

Here is my starting string from the subject header of an email:

<?php
//This is AFTER exploding the : of the header and using trim on $p[1]:
$s = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';
//orkut – convite enviado por Lais Piccirillo
?>

Typically the next step is to do the following:

$s = imap_mime_header_decode($s);//orkut � convite enviado por Lais Piccirillo

Typically past that point I'd do the following:

$s = mb_convert_encoding($subject, 'UTF-8', mb_detect_encoding($s));//en dash missing!

Now I received a static answer for an earlier static question . Eventually I was able to put this working set of code together:

<?php
$s1 = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';

//Attempt to determine the character set:
$en = mb_detect_encoding($s1);//ASCII; wrong!!!
$p = explode('?', $s1, 3)[1];//ISO-8859-1; wrong!!!

//Necessary to decode the q-encoded header text any way FIRST:
$s2 = imap_mime_header_decode($s1);

//Now scan for character exceptions in the original text to compensate for PHP:
if (strpos($s1, '=96') !== false) {$s2 = mb_convert_encoding($s2[0]->text, 'UTF-8', 'CP1252');}
else {$s2 = mb_convert_encoding($s2[0]->text, 'UTF-8');}

//String is finally ready for client output:
echo '<pre>'.print_r($s2,1).'</pre>';//orkut – convite enviado por Lais Piccirillo
?>

Now either I've still programmed this incorrectly and there is something in PHP I'm missing (tried many combinations of html_entity_decode , iconv , mb_convert_encoding and utf8_encode ) or, at least for the moment with PHP 8, we'll be forced to detect specific characters and manually override the encoding as I've done on line 12. In the later case a bug report either needs to be created or likely updated if one specific to this issue already exists.

So technically the question is:

How do we properly detect all character encodings to prevent any characters from being lost during the conversion of strings to UTF-8?

If no such proper answer exists valid answers include characters that when among otherwise purely ASCII text results in PHP failing to properly detect the correct character encoding thus resulting in an incorrect UTF-8 encoded string. Presuming this issue becomes fixed in the future and can be validated against all odd-ball characters listed in all of the other relevant answers then a proper answer can be accepted.

Answer 1

You are blaming PHP for something that PHP could not possibly solve:

$s1 is an ASCII string; just as the string "smiling face emoji" is ASCII, even though it describes the string "".
$s2 is decoded according to the information you were sent . In fact, it's decoded into a raw sequence of bytes, and a label which was provided in the input.

Your actual problem is that the information you were sent was wrong - the system that sent it to you has made the common mistake of mislabelling Windows-1252 as ISO-8859-1.

The difference between the two encodings is that bytes from 0x80 to 0x9F are control characters in ISO 8859 and (mostly) assigned to printable characters in Windows-1252. Note that there is no way for any system to automatically tell you which interpretation was intended - either way, there is simply a byte in memory containing 0x96. However, it is overwhelmingly more likely that any such bytes are intended to be Windows-1252 characters and not the very rarely used extra control characters from ISO 8859, so a common solution is simply to assume that any data labelled ISO-8859-1 is actually Windows-1252 .

That makes the solution really very simple:

// $input is the ASCII string you've received
$input = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';

// Decode the string into its labelled encoding, and string of bytes
$mime_decoded = imap_mime_header_decode($input);
$input_encoding = $mime_decode[0]->charset;
$raw_bytes = $mime_decode[0]->text;

// If it claims to be ISO-8859-1, assume it's lying
if ( $input_encoding === 'ISO-8859-1' ) {
    $input_encoding = 'Windows-1252';
}

// Now convert from a known encoding to UTF-8 for the use of your application
$utf8_string = mb_convert_encoding($raw_bytes, 'UTF-8', $input_encoding);

List of known troublesome characters that causes PHP to fail to detect the proper character encoding before converting to UTF-8 resulting in lost data

Question

1 answers

solution1
1 2021-11-22 10:47:28

List of known troublesome characters that causes PHP to fail to detect the proper character encoding before converting to UTF-8 resulting in lost data

Question

1 answers

solution1 1 2021-11-22 10:47:28

solution1
1 2021-11-22 10:47:28