PHP cp1252 / windows-1252转换为UTF-8

Question

我正在尝试将我们的数据库从latin1转换为UTF-8。 不幸的是，我无法进行大规模的单一切换，因为应用程序需要保持在线状态，我们有700GB的数据库进行转换。

所以我试图利用一个小的mysql hack将表转换为UTF-8而不是数据。 我想要实时读取，转换和替换数据。 （如果你愿意，可以进行JIT转换）

我们的php应用程序目前使用所有默认值，因此它使用latin1字符集连接到mysql，它会丢弃以latin1编码的UTF-8数据。 使用latin1查看数据时，UTF-8字符将按预期显示。 当您使用UTF-8查看数据时，事情会变得混乱。

因此，我建议将mysql字符集强制为UTF-8，然后在必要时进行数据的及时转换。 现在，看到cp1252 / windows-1252是UTF-8的子集，它不是那么直接（据我所知）来检测cp1252 / windows-1252编码。

我编写了以下代码，试图检测cp1252 / windows-1252编码并根据需要进行转换。 它还应检测正确编码的UTF-8字符并且不执行任何操作。

$a = 'Cardâ˜ƒ'; //cp1252 encoded
$a_test = '☃'.$a; //add known UTF8 character
$c = mb_convert_encoding($a_test, 'cp1252', 'UTF-8');
// attempt to detect known utf8 character after conversion
if (mb_strpos($c, '☃') === false) {
    // not found, original string was not cp1252 encoded, so print
    var_dump($a);
} else {
    // found, original string was cp1252 encoded, remove test character and print
    // This case runs
    $c = mb_strcut($c, 1);
    var_dump($c);
}

$a = 'COD☃'; //proper UTF8 encoded
$a_test = '☃'.$a; //add known UTF8 character
$c = mb_convert_encoding($a_test, 'cp1252', 'UTF-8');
// attempt to detect known utf8 character after conversion
if (mb_strpos($c, '☃') === false) {
    // not found, original string was not cp1252 encoded, so print
    // This case runs
    var_dump($a);
} else {
    // found, original string was cp1252 encoded, remove test character and print
    $c = mb_strcut($c, 1);
    var_dump($c);
}

运行此代码的输出是：

string 'Card☃' (length=7)
string 'COD☃' (length=6)

我知道在数据库中出现的所有字符串上运行它会对性能产生影响，但还有待衡量，但是如果我能在完全切换所有内容之前进行JIT转换，那么对我来说是值得的。

有没有人对如何优化这一点有任何指示？

Answer 1

首先，Windows-1252 不是 UTF-8的子集。 你可以说ASCII是UTF-8的一个子集，但这通常更像是一种意识形态的争论。

其次，不可能处理CP1252和UTF-8“字符”的字符串（实际上对于CP1252它是一个字节而对于Unicode来说它是一个代码点）。 您尝试将其读作CP1252，并将所有Unicode字符视为单个字节，或者将其读作UTF-8并删除任何无效的字节序列（如果CP1252字符与Unicode代码点匹配，则创建随机字符）。 您没有使用$c = mb_strcut($c, 1);删除测试字符$c = mb_strcut($c, 1); ，您要删除由mb_convert_encoding创建的问号，因为它无法将该Unicode字符转换为CP1252字符。

第三，你永远不应该转换一个String，然后尝试确定编码。 转换完第二个测试字符串后，它是?COD? 。 没有理由检查其中是否存在Unicode字符，因为您已将其转换为CP1252。 其中不能包含Unicode字符。 作为程序员，您必须知道输出是什么。

唯一的解决方案是检查字符串是否为CP1252，将有问题的字符转换为占位符，然后将该字符串转换为Unicode：

function convert_cp1252_to_utf8($input, $default = '', $replace = array()) {
    if ($input === null || $input == '') {
        return $default;
    }

    // https://en.wikipedia.org/wiki/UTF-8
    // https://en.wikipedia.org/wiki/ISO/IEC_8859-1
    // https://en.wikipedia.org/wiki/Windows-1252
    // http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
    $encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true);
    if ($encoding == 'ISO-8859-1' || $encoding == 'Windows-1252') {
        /*
         * Use the search/replace arrays if a character needs to be replaced with
         * something other than its Unicode equivalent.
         */ 

        /*$replace = array(
            128 => "&#x20AC;",      // http://www.fileformat.info/info/unicode/char/20AC/index.htm EURO SIGN
            129 => "",              // UNDEFINED
            130 => "&#x201A;",      // http://www.fileformat.info/info/unicode/char/201A/index.htm SINGLE LOW-9 QUOTATION MARK
            131 => "&#x0192;",      // http://www.fileformat.info/info/unicode/char/0192/index.htm LATIN SMALL LETTER F WITH HOOK
            132 => "&#x201E;",      // http://www.fileformat.info/info/unicode/char/201e/index.htm DOUBLE LOW-9 QUOTATION MARK
            133 => "&#x2026;",      // http://www.fileformat.info/info/unicode/char/2026/index.htm HORIZONTAL ELLIPSIS
            134 => "&#x2020;",      // http://www.fileformat.info/info/unicode/char/2020/index.htm DAGGER
            135 => "&#x2021;",      // http://www.fileformat.info/info/unicode/char/2021/index.htm DOUBLE DAGGER
            136 => "&#x02C6;",      // http://www.fileformat.info/info/unicode/char/02c6/index.htm MODIFIER LETTER CIRCUMFLEX ACCENT
            137 => "&#x2030;",      // http://www.fileformat.info/info/unicode/char/2030/index.htm PER MILLE SIGN
            138 => "&#x0160;",      // http://www.fileformat.info/info/unicode/char/0160/index.htm LATIN CAPITAL LETTER S WITH CARON
            139 => "&#x2039;",      // http://www.fileformat.info/info/unicode/char/2039/index.htm SINGLE LEFT-POINTING ANGLE QUOTATION MARK
            140 => "&#x0152;",      // http://www.fileformat.info/info/unicode/char/0152/index.htm LATIN CAPITAL LIGATURE OE
            141 => "",              // UNDEFINED
            142 => "&#x017D;",      // http://www.fileformat.info/info/unicode/char/017d/index.htm LATIN CAPITAL LETTER Z WITH CARON 
            143 => "",              // UNDEFINED
            144 => "",              // UNDEFINED
            145 => "&#x2018;",      // http://www.fileformat.info/info/unicode/char/2018/index.htm LEFT SINGLE QUOTATION MARK 
            146 => "&#x2019;",      // http://www.fileformat.info/info/unicode/char/2019/index.htm RIGHT SINGLE QUOTATION MARK
            147 => "&#x201C;",      // http://www.fileformat.info/info/unicode/char/201c/index.htm LEFT DOUBLE QUOTATION MARK
            148 => "&#x201D;",      // http://www.fileformat.info/info/unicode/char/201d/index.htm RIGHT DOUBLE QUOTATION MARK
            149 => "&#x2022;",      // http://www.fileformat.info/info/unicode/char/2022/index.htm BULLET
            150 => "&#x2013;",      // http://www.fileformat.info/info/unicode/char/2013/index.htm EN DASH
            151 => "&#x2014;",      // http://www.fileformat.info/info/unicode/char/2014/index.htm EM DASH
            152 => "&#x02DC;",      // http://www.fileformat.info/info/unicode/char/02DC/index.htm SMALL TILDE
            153 => "&#x2122;",      // http://www.fileformat.info/info/unicode/char/2122/index.htm TRADE MARK SIGN
            154 => "&#x0161;",      // http://www.fileformat.info/info/unicode/char/0161/index.htm LATIN SMALL LETTER S WITH CARON
            155 => "&#x203A;",      // http://www.fileformat.info/info/unicode/char/203A/index.htm SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
            156 => "&#x0153;",      // http://www.fileformat.info/info/unicode/char/0153/index.htm LATIN SMALL LIGATURE OE
            157 => "",              // UNDEFINED
            158 => "&#x017e;",      // http://www.fileformat.info/info/unicode/char/017E/index.htm LATIN SMALL LETTER Z WITH CARON
            159 => "&#x0178;",      // http://www.fileformat.info/info/unicode/char/0178/index.htm LATIN CAPITAL LETTER Y WITH DIAERESIS
        );*/

        if (count($replace) != 0) {
            $find = array();
            foreach (array_keys($replace) as $key) {
                $find[] = chr($key);
            }
            $input = str_replace($find, array_values($replace), $input);
        }
        /*
         * Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F
         * and control characters, always convert from Windows-1252 to UTF-8.
         */
        $input = iconv('Windows-1252', 'UTF-8//IGNORE', $input);
        if (count($replace) != 0) {
            $input = html_entity_decode($input);
        }
    }
    return $input;
}

诀窍是你必须检查ISO-8859-1和CP1252因为它们非常相似。 经过几个小时玩这个功能后，我发现了这个问题，只有这个答案才能救我。 如果您发现此功能有帮助，请转+1回答。

基本上，此函数用表示Unicode字符的HTML实体替换所有那些错误的CP1252字节。 然后我们将字符串从ISO-8859-1 / CP1252为UTF-8 ，而我们的新Unicode字符都没有被破坏，因为它们是简单的ASCII字符。 最后，我们解码HTML实体，最后得到100％的Unicode字符串。

PHP cp1252 / windows-1252转换为UTF-8

问题描述

1 个解决方案

解决方案1
15 2014-04-23 16:22:07

PHP cp1252 / windows-1252转换为UTF-8

问题描述

1 个解决方案

解决方案1 15 2014-04-23 16:22:07

解决方案1
15 2014-04-23 16:22:07