如何“解码” UTF-8字符？

Question

Let's assume I want to write a function to compare two Unicode characters. 假设我要编写一个比较两个Unicode字符的函数。 How should I do that? 我应该怎么做？ I read some articles around (like this ) but still didn't got that. 我读了一些文章（像这样），但还是没有。 Let's take € as input. 让我们以€作为输入。 It's in range 0x0800 and 0xFFFF so it will use 3 bytes to encode it. 它的范围是0x0800和0xFFFF因此它将使用3个字节对其进行编码。 How do I decode it? 如何解码？ bitwise operation to get 3 bytes from wchar_t and store into 3 char s? 从wchar_t获取3个字节并存储到3个char的按位运算？ A code in example in C could be great. 用C语言编写的示例代码可能很棒。

Here's my C code to "decode" but obviously show wrong value to decode unicode... 这是我的C代码“解码”，但显然显示错误的值来解码unicode ...

#include <stdio.h>
#include <wchar.h>

void printbin(unsigned n);
int length(wchar_t c);
void print(struct Bytes *b);

// support for UTF8 which encodes up to 4 bytes only
struct Bytes
{
    char v1;
    char v2;
    char v3;
    char v4;
};

int main(void)
{
    struct Bytes bytes = { 0 };
    wchar_t c = '€';
    int len = length(c);

    //c = 11100010 10000010 10101100
    bytes.v1 = (c >> 24) << 4; // get first byte and remove leading "1110"
    bytes.v2 = (c >> 16) << 5; // skip over first byte and get 000010 from 10000010
    bytes.v3 = (c >> 8)  << 5; // skip over first two bytes and 10101100 from 10000010
    print(&bytes);

    return 0;
}

void print(struct Bytes *b)
{
    int v1 = (int) (b->v1);
    int v2 = (int)(b->v2);
    int v3 = (int)(b->v3);
    int v4 = (int)(b->v4);

    printf("v1 = %d\n", v1);
    printf("v2 = %d\n", v2);
    printf("v3 = %d\n", v3);
    printf("v4 = %d\n", v4);
}

int length(wchar_t c)
{
    if (c >= 0 && c < 0x007F)
        return 1;
    if (c >= 0x0080 && c <= 0x07FF)
        return 2;
    if (c >= 0x0800 && c <= 0xFFFF)
        return 3;
    if (c >= 0x10000 && c <= 0x1FFFFF)
        return 4;
    if (c >= 0x200000 && c <= 0x3FFFFFF)
        return 5;
    if (c >= 0x4000000 && c <= 0x7FFFFFFF)
        return 6;

    return -1;
}

void printbin(unsigned n)
{
    if (!n)
        return;

    printbin(n >> 1);
    printf("%c", (n & 1) ? '1' : '0');
}

Answer 1

It's not at all easy to compare UTF-8 encoded characters. 比较UTF-8编码的字符并不容易。 Best not to try. 最好不要尝试。 Either: 或者：

Convert them both to a wide format (32 bit integer) and compare this arithmetically. 将它们都转换为宽格式（32位整数），然后进行算术比较。 See wstring_convert or your favorite vendor-specific function; 请参阅wstring_convert或您最喜欢的供应商特定函数； or 要么
Convert them into 1 character strings and use a function that compares UTF-8 encoded strings. 将它们转换为1个字符串，并使用一个比较UTF-8编码字符串的函数。 There is no standard way to do this in C++, but it is the preferred method in other languages such as Ruby, PHP, whatever. 在C ++中没有标准的方法来执行此操作，但是它是其他语言（例如Ruby，PHP等）中的首选方法。

Just to make it clear, the thing that is hard is to take raw bits/bytes/characters encoded as UTF_8 and compare them. 为了清楚起见，很难做到的是获取编码为UTF_8的原始位/字节/字符并进行比较。 This is because your comparison has to take account of the encoding to know whether to compare 8 bits, 16 bits or more. 这是因为您的比较必须考虑编码才能知道是比较8位，16位还是更多位。 If you can somehow turn the raw data bits into a null-terminated string then the comparison is trivially easy using regular string functions. 如果您可以通过某种方式将原始数据位转换为以零结尾的字符串，则使用常规字符串函数比较起来非常容易。 This string may be more than one byte/octet in length, but it will represent a single character/code point. 该字符串的长度可能超过一个字节/八位字节，但是它将代表一个字符/代码点。

Windows is a bit of a special case. Windows有点特殊情况。 Wide characters are short int (16-bit). 宽字符为short int（16位）。 Historically this meant UCS-2 but it has been redefined as UTF-16. 从历史上讲，这意味着UCS-2，但已将其重新定义为UTF-16。 This means that all valid characters in the Basic Multilingual Plane (BMP) can be compared directly, since they will occupy a single short int, but others cannot. 这意味着可以直接比较基本多语言平面（BMP）中的所有有效字符，因为它们将占据单个short int，而其他字符则不能。 I am not aware of any simple way to deal with 32-bit wide characters (represented as a simple int) outside the BMP on Windows. 我不知道有任何简单的方法可以在Windows上的BMP之外处理32位宽的字符（表示为简单的int）。

如何“解码” UTF-8字符？

问题描述

1 个解决方案

解决方案1
1 2014-08-25 03:25:17

如何“解码” UTF-8字符？

问题描述

1 个解决方案

解决方案1 1 2014-08-25 03:25:17

解决方案1
1 2014-08-25 03:25:17