Understanding character encoding in PHP

Question

I am struggling at understanding character encoding in PHP.

Consider the following script (you can run it here ):

$string = "\xe2\x82\xac";

var_dump(mb_internal_encoding());
var_dump($string);
var_dump(unpack('C*', $string));
$utf8string = mb_convert_encoding($string, "UTF-8");
var_dump($utf8string);
var_dump(unpack('C*', $utf8string));

mb_internal_encoding("UTF-8");

var_dump($string);
var_dump($utf8string);

I have a string, actually the € character, represented with its unicode code points. Up to PHP 5.5 the used internal encoding is ISO-8859-1 , hence I think that my string will be encoded using this encoding. With unpack I can see the bite representation of my string, and it corresponds to the hexadecimal codes I use to define the string.

Then I convert the encoding of the string to UTF-8 , using mb_convert_encoding . At this point the string displays differently on the screen and its byte representation changes (and this is expected).

If I change the PHP internal encoding also to UTF-8 , I'd expect utf8string to be displayed correctly on the screen, but this doesn't happen.

What I am missing?

Answer 1

The script you show doesn't use any non-ascii characters, so its internal encoding does not make any difference. mb_internal_encoding does convert your data on output. This question will tell you more about how it works; it will also tell you it's better not to use it.

The three-byte string $string in your code is the UTF-8 representation of the Euro symbol, not its "unicode code point" (which is 2 bytes wide, like all common Unicode characters: 0x20ac ).

Does this clear up the behavior you see?

Answer 2

You started with a string that is the utf-8 representation of the Euro symbol. If you run echo($string) all versions of PHP produce the three bytes you put in $string . How they are interpreted by the browser depends on the character set specified in the Content-Type header. If it is text/html; charset=utf-8 text/html; charset=utf-8 then you get the Euro sign in the rendered page.

Then you do the wrong move. You call mb_convert_encoding() with only two arguments. This lets PHP use the current value of its internal encoding used by the mb_string extension for the the third argument ( $from_encoding ). Why?

For PHP 5.6 and newer, the default value returned by mb_internal_encoding() is utf-8 and the call to mb_convert_encoding() is a no-op.

But for previous versions of PHP, the default value returned by mb_internal_encoding() is iso-8859-1 and it doesn't match the encoding of your string. Accordingly, mb_convert_encoding() interprets the bytes of $string as three individual characters and encodes them using the rules of utf-8 . The outcome is obviously wrong.

Btw, if you initialize $string with '€' you get the same output on all PHP versions (even on PHP 4, iirc).

Understanding character encoding in PHP

Question

2 answers

solution1
2 2016-04-19 20:27:29

solution2
1 2016-04-19 20:42:11

Understanding character encoding in PHP

Question

2 answers

solution1 2 2016-04-19 20:27:29

solution2 1 2016-04-19 20:42:11

solution1
2 2016-04-19 20:27:29

solution2
1 2016-04-19 20:42:11