简体   繁体   English

C++ 将 UTF-8 字符串迭代或拆分为符号数组?

[英]C++ iterate or split UTF-8 string into array of symbols?

Searching for a platform- and 3rd-party-library- independent way of iterating UTF-8 string or splitting it into array of UTF-8 symbols.搜索与平台和第 3 方库无关的迭代 UTF-8 字符串或将其拆分为 UTF-8 符号数组的方法。

Please post a code snippet.请发布代码片段。

Solved: C++ iterate or split UTF-8 string into array of symbols?已解决: C++ 将 UTF-8 字符串迭代或拆分为符号数组?

If I understand correctly, it sounds like you want to find the start of each UTF-8 character.如果我理解正确,听起来您想找到每个 UTF-8 字符的开头。 If so, then it would be fairly straightforward to parse them (interpreting them is a different matter).如果是这样,那么解析它们将相当简单(解释它们是另一回事)。 But the definition of how many octets are involved is well-defined by the RFC :但是涉及多少个八位字节的定义由RFC明确定义:

Char. number range  |        UTF-8 octet sequence
   (hexadecimal)    |              (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

For example, if lb has the first octet of a UTF-8 character, I think the following would determine the number of octets involved.例如,如果lb具有 UTF-8 字符的第一个八位字节,我认为以下将确定所涉及的八位字节数。

unsigned char lb;

if (( lb & 0x80 ) == 0 )          // lead bit is zero, must be a single ascii
   printf( "1 octet\n" );
else if (( lb & 0xE0 ) == 0xC0 )  // 110x xxxx
   printf( "2 octets\n" );
else if (( lb & 0xF0 ) == 0xE0 ) // 1110 xxxx
   printf( "3 octets\n" );
else if (( lb & 0xF8 ) == 0xF0 ) // 1111 0xxx
   printf( "4 octets\n" );
else
   printf( "Unrecognized lead byte (%02x)\n", lb );

Ultimately, though, you are going to be much better off using an existing library as suggested in another post.不过,最终,按照另一篇文章中的建议,使用现有库会好得多。 The above code might categorize the characters according to octets, but it doesn't help "do" anything with them once that is finished.上面的代码可能会根据八位字节对字符进行分类,但是一旦完成,它就无助于对它们“做”任何事情。

Solved using tiny platform-independent UTF8 CPP library:使用与平台无关的微型UTF8 CPP库解决:

    char* str = (char*)text.c_str();    // utf-8 string
    char* str_i = str;                  // string iterator
    char* end = str+strlen(str)+1;      // end iterator

    do
    {
        uint32_t code = utf8::next(str_i, end); // get 32 bit code of a utf-8 symbol
        if (code == 0)
            continue;

        unsigned char[5] symbol = {0};
        utf8::append(code, symbol); // copy code to symbol

        // ... do something with symbol
    }
    while ( str_i < end );

UTF8 CPP正是你想要的

试试ICU 图书馆

Off the cuff:袖口外:

// Return length of s converted. On success return should equal s.length().
// On error return points to the character where decoding failed.
// Remember to check the success flag since decoding errors could occur at
// the end of the string
int convert(std::vector<int>& u, const std::string& s, bool& success) {
    success = false;
    int cp = 0;
    int runlen = 0;
    for (std::string::const_iterator it = s.begin(), end = s.end(); it != end; ++it) {
        int ch = static_cast<unsigned char>(*it);
        if (runlen > 0) {
            if ((ch & 0xc0 != 0x80) || cp == 0) return it-s.begin();
            cp = (cp << 6) + (ch & 0x3f);
            if (--runlen == 0) {
                u.push_back(cp);
                cp = 0;
            }
        }
        else if (cp == 0) {
            if (ch < 0x80)      { u.push_back(ch); }
            else if (ch > 0xf8) return it-s.begin();
            else if (ch > 0xf0) { cp = ch & 7; runlen = 3; }
            else if (ch > 0xe0) { cp = ch & 0xf; runlen = 2; }
            else if (ch > 0xc0) { cp = ch & 0x1f; runlen = 1; }
            else return it-s.begin(); // stop on error
        }
        else return it-s.begin();
    }
    success = runlen == 0; // verify we are between codepoints
    return s.length();
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM