如何轻松检测字符串中的utf8编码？

Question

我有来自其他程序的数据填充的字符串，这些数据可以是UTF8编码。 所以，如果不是我可以编码为UTF8，但在C ++中检测UTF8的最佳方法是什么？ 我看到了这个变种https：//stackoverflow.com/questions / ...但是有评论说这个解决方案没有100％检测。 因此，如果我对已经包含UTF8数据的UTF8字符串进行编码，那么我将错误的文本写入数据库。

所以我可以使用这个UTF8检测：

bool is_utf8(const char * string)
{
    if(!string)
        return 0;

    const unsigned char * bytes = (const unsigned char *)string;
    while(*bytes)
    {
        if( (// ASCII
             // use bytes[0] <= 0x7F to allow ASCII control characters
                bytes[0] == 0x09 ||
                bytes[0] == 0x0A ||
                bytes[0] == 0x0D ||
                (0x20 <= bytes[0] && bytes[0] <= 0x7E)
            )
        ) {
            bytes += 1;
            continue;
        }

        if( (// non-overlong 2-byte
                (0xC2 <= bytes[0] && bytes[0] <= 0xDF) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF)
            )
        ) {
            bytes += 2;
            continue;
        }

        if( (// excluding overlongs
                bytes[0] == 0xE0 &&
                (0xA0 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            ) ||
            (// straight 3-byte
                ((0xE1 <= bytes[0] && bytes[0] <= 0xEC) ||
                    bytes[0] == 0xEE ||
                    bytes[0] == 0xEF) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            ) ||
            (// excluding surrogates
                bytes[0] == 0xED &&
                (0x80 <= bytes[1] && bytes[1] <= 0x9F) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            )
        ) {
            bytes += 3;
            continue;
        }

        if( (// planes 1-3
                bytes[0] == 0xF0 &&
                (0x90 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            ) ||
            (// planes 4-15
                (0xF1 <= bytes[0] && bytes[0] <= 0xF3) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            ) ||
            (// plane 16
                bytes[0] == 0xF4 &&
                (0x80 <= bytes[1] && bytes[1] <= 0x8F) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            )
        ) {
            bytes += 4;
            continue;
        }

        return 0;
    }

    return 1;
}

如果检测不正确，则此代码用于编码为UTF8：

     string text;
     if(!is_utf8(EscReason.c_str()))
     {
        int size = MultiByteToWideChar(CP_ACP, MB_COMPOSITE, text.c_str(),
            text.length(), 0, 0);
        std::wstring utf16_str(size, '\0');

        MultiByteToWideChar(CP_ACP, MB_COMPOSITE, text.c_str(),
            text.length(), &utf16_str[0], size);

        int utf8_size = WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
            utf16_str.length(), 0, 0, 0, 0);

        std::string utf8_str(utf8_size, '\0');
        WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
            utf16_str.length(), &utf8_str[0], utf8_size, 0, 0);

        text = utf8_str;
     }

或者上面的代码没有正确完成？ 我也是在Windows 7中做的。那么Ubuntu呢？ 这种变体在那里有效吗？

Answer 1

比较整个字节值不是检测UTF-8的正确方法。 您必须分析每个字节的实际位模式。 UTF-8使用非常不同的位模式，没有其他编码使用。 尝试更像这样的东西：

bool is_utf8(const char * string)
{
    if (!string)
        return true;

    const unsigned char * bytes = (const unsigned char *)string;
    int num;

    while (*bytes != 0x00)
    {
        if ((*bytes & 0x80) == 0x00)
        {
            // U+0000 to U+007F 
            num = 1;
        }
        else if ((*bytes & 0xE0) == 0xC0)
        {
            // U+0080 to U+07FF 
            num = 2;
        }
        else if ((*bytes & 0xF0) == 0xE0)
        {
            // U+0800 to U+FFFF 
            num = 3;
        }
        else if ((*bytes & 0xF8) == 0xF0)
        {
            // U+10000 to U+10FFFF 
            num = 4;
        }
        else
            return false;

        bytes += 1;
        for (int i = 1; i < num; ++i)
        {
            if ((*bytes & 0xC0) != 0x80)
                return false;
            bytes += 1;
        }
    }

    return true;
}

现在，这没有考虑非法的UTF-8序列，例如超长编码，UTF-16代理和U + 10FFFF以上的代码点。 如果你想确保UTF-8既有效又正确，你需要更像这样的东西：

bool is_valid_utf8(const char * string)
{
    if (!string)
        return true;

    const unsigned char * bytes = (const unsigned char *)string;
    unsigned int cp;
    int num;

    while (*bytes != 0x00)
    {
        if ((*bytes & 0x80) == 0x00)
        {
            // U+0000 to U+007F 
            cp = (*bytes & 0x7F);
            num = 1;
        }
        else if ((*bytes & 0xE0) == 0xC0)
        {
            // U+0080 to U+07FF 
            cp = (*bytes & 0x1F);
            num = 2;
        }
        else if ((*bytes & 0xF0) == 0xE0)
        {
            // U+0800 to U+FFFF 
            cp = (*bytes & 0x0F);
            num = 3;
        }
        else if ((*bytes & 0xF8) == 0xF0)
        {
            // U+10000 to U+10FFFF 
            cp = (*bytes & 0x07);
            num = 4;
        }
        else
            return false;

        bytes += 1;
        for (int i = 1; i < num; ++i)
        {
            if ((*bytes & 0xC0) != 0x80)
                return false;
            cp = (cp << 6) | (*bytes & 0x3F);
            bytes += 1;
        }

        if ((cp > 0x10FFFF) ||
            ((cp >= 0xD800) && (cp <= 0xDFFF)) ||
            ((cp <= 0x007F) && (num != 1)) ||
            ((cp >= 0x0080) && (cp <= 0x07FF) && (num != 2)) ||
            ((cp >= 0x0800) && (cp <= 0xFFFF) && (num != 3)) ||
            ((cp >= 0x10000) && (cp <= 0x1FFFFF) && (num != 4)))
            return false;
    }

    return true;
}

Answer 2

您可能不了解UTF-8和替代品。 一个字节只有256个可能的值。 考虑到角色的数量，这并不是很多。 因此，许多字节序列都是有效的UTF-8字符串和其他编码中的有效字符串。

实际上，每个ASCII字符串都是有意义的有效UTF-8字符串，其含义基本相同。 对于ìs_utf8("Hello")您的代码将返回true 。

甚至许多其他非UTF8，非ASCII字符串共享具有有效UTF-8字符串的字节序列。 并且没有办法将非UTF-8字符串转换为UTF-8而不确切知道它是什么类型的非UTF-8编码。 甚至Latin-1和Latin-2也已经完全不同了。 CP_ACP甚至比Latin-1差， CP_ACP在各地都不一样。

您的文本必须以UTF-8的形式进入数据库。 因此，如果它还不是UTF-8，则必须进行转换，并且您必须知道确切的源编码。 没有神奇的逃脱。

在Linux上， iconv是在2种编码之间进行转换的常用方法。

如何轻松检测字符串中的utf8编码？

问题描述

2 个解决方案

解决方案1
10 2015-02-04 00:51:56

解决方案2
6 已采纳 2015-02-02 09:29:22

如何轻松检测字符串中的utf8编码？

问题描述

2 个解决方案

解决方案1 10 2015-02-04 00:51:56

解决方案2 6 已采纳 2015-02-02 09:29:22

解决方案1
10 2015-02-04 00:51:56

解决方案2
6 已采纳 2015-02-02 09:29:22