简体   繁体   中英

Php, detecting the possible output encoding for an utf-8 character

I am trying to decode php string out from utf-8 to a required encoding (iso-8859-2). The problem is, that the utf-8 string has characters that do not fit in iso-8859-2, but are converted to utf-8 from windows-1251 (although they look exactly the same as if are native for ISO-8859-2). Those characters are represented by "?" on the output.

If I try to convert the same string to windows-1251, the same characters appear, but then the missing characters are respectively the ones native for iso-8859-2 (like "ä","ö", etc.)

I get the strings from a mysql database and need a conversion to a non-unicode charset and storing them into sqlite database file, because the program in which they are going to be used does not support unicode.

So, my question is is there a way to get the possible no-unicode encoding for a character in utf-8? I am currently iterating through the whole utf string and try to decode each character one by one but the windows-1251 characters are still missing.

the code looks like that:


$string = "various charset input";

$str = str_split_unicode($string,1); // The function from the php-str_split manual page, splits utf string into an array

$handler = "";

foreach($str as $value):
    $currentChar = iconv("utf-8", "iso-8859-2", $value) or "%no%";

    if($currentChar == "%no%" ):
        $currentChar = ""; 
        $currentChar = iconv("utf-8", "windows-1251", $value) or "%no%";
    endif;

    if($currentChar != "%no%"):

        $handler .= $currentChar;

    else:

        $handler .= $value;

    endif;

endforeach;

$string = $handler;

But the question marks are still there.

UPDATE

Thanks CertaiN, I edited the function you provided (it may have become less readable though) so it converts the character back to an appropriate encoding.

FUNCTION



    function utf8_to_multicharset($str, $encoding, $htmSupportedOutput="iso-8859-15") {

        $utf8 = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
        $out = $utf8;
        mb_convert_variables($encoding, 'UTF-8', $out);

    is_array($htmSupportedOutput) or $htmSupportedOutput = explode(",",$htmSupportedOutput);

        $table = get_html_translation_table(HTML_SPECIALCHARS | ENT_QUOTES);

        foreach ($out as $i => &$char) {

            if ($char === '?' && $utf8[$i] !== '?') {

                $char = mb_convert_encoding($utf8[$i], 'HTML-ENTITIES', 'UTF-8');

            } 
            elseif (isset($table[$char])) {

                $char = $table[$char];

            }


        foreach($htmSupportedOutput as $o):

            $char = html_entity_decode($char,null,$o);

        endforeach;
        }

    return implode('', $out);
    }

Now it checks from a list of specified encodings and converts the string to an encoding which supports it like this:

Example

Php usage:


    <?php
       $string = "vatiöus charset иnput";
       $result = utf8_to_multicharset($string,"iso-8859-2","cp1252,cp1251,koi8r");
    ?>

Do you need HTML Entity Encoding for them?

Function

function utf8_to_escaped_another($str, $encoding) {
    $utf8 = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
    $out = $utf8;
    mb_convert_variables($encoding, 'UTF-8', $out);
    $table = get_html_translation_table(HTML_SPECIALCHARS | ENT_QUOTES);
    foreach ($out as $i => &$char) {
        if ($char === '?' && $utf8[$i] !== '?') {
            $char = mb_convert_encoding($utf8[$i], 'HTML-ENTITIES', 'UTF-8');
        } elseif (isset($table[$char])) {
            $char = $table[$char];
        }
    }
    return implode('', $out);
}

Example

PHP Source Code

<?php

function utf8_to_escaped_another($str, $encoding) {
    $utf8 = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
    $out = $utf8;
    mb_convert_variables($encoding, 'UTF-8', $out);
    $table = get_html_translation_table(HTML_SPECIALCHARS | ENT_QUOTES);
    foreach ($out as $i => &$char) {
        if ($char === '?' && $utf8[$i] !== '?') {
            $char = mb_convert_encoding($utf8[$i], 'HTML-ENTITIES', 'UTF-8');
        } elseif (isset($table[$char])) {
            $char = $table[$char];
        }
    }
    return implode('', $out);
}

header('Content-Type: text/html; charset=ISO-8859-2');

$text = <<<EOD
English: Good Morning
Arabic: صباح الخير
Japanese: おはよう
EOD;

echo '<pre>';
echo utf8_to_escaped_another($text, 'ISO-8859-2');
echo '</pre>';

HTML View

English: Good Morning
Arabic: صباح الخير
Japanese: おはよう

HTML Source Code

<pre>English: Good Morning
Arabic: &#1589;&#1576;&#1575;&#1581; &#1575;&#1604;&#1582;&#1610;&#1585;
Japanese: &#12362;&#12399;&#12424;&#12358;</pre>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM