简体   繁体   中英

Converting from HTML entities to UTF-8

I have a problem converting some encoded strings to utf-8.

I have a list of strings which according to the documentation are Unicode strings encoded using numeric HTML entities. Some of them are:

$str = 'WÖGER'; // seems to be WÖGER
$str = 'Jürgen'; // seems to be Jürgen
$str = 'POßNITZ'; // seems to be POßNITZ
$str = 'SCHLÄGER'; // seems to be SCHLÄGER

I want to decode them and convert to utf-8.

I tried both mb_convert_encoding() with HTML-ENTITIES param as well as html_entity_decode() . My best result unexpectedly was with:

html_entity_decode($str, ENT_COMPAT | ENT_HTML401, 'ISO-8859-1');

and that decoded Jürgen successfully . However I have no luck decoding other strings from this list. I looked ISO-8859-1 encoding table and HTML codes for umlauts there differ from what I have in my list.

My question is: am I missing some obvious decoding step or is there something wrong with the source strings?

Update (2016-06-27): The original strings were indeed incorrectly encoded. These strings are the result of reading UTF-8 values in Latin-1 context and then encoding individual 1-byte chars as hex entities, so german umlaut ü became ü and was encoded as 2 separate chars. The accepted answer decodes them straight into UTF-8 successfully.

My understanding is, though I might be wrong, that unicode characters should be represented by their codepoint, and not by encoding individual UTF-8 bytes, which is what you have. So, Ö would be better represented using Ö or in the named form, Ö .

The ENT_XML1 flag to html_entity_decode does seem to make this work, though I'm not entirely sure what it does under the hood. If you want something more explicit:

preg_replace_callback('/&#x([A-Fa-f0-9]{2});/', function ($m) {
    return chr(hexdec($m[1]));
}, $str);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM