简体   繁体   English

如何在没有wchar_t的情况下在C ++中对UTF-8字符进行解码/编码

[英]How to decode/encode a UTF-8 char in c++ without wchar_t

As the title states, I am attempting to decode/encode UTF-8 characters to a char, but I want to do it without using wchar_t or the like. 如标题所示,我试图将UTF-8字符解码/编码为char,但是我想不使用wchar_t等来执行此操作。 I want to do the leg work myself. 我想自己做腿部工作。 This way I know that I understand it, which I obviously don't or it would be working. 这样,我知道我能理解,但我显然不理解,或者它将起作用。 I've spent about a week toying with it and am just not making progress. 我已经花了大约一个星期的时间在玩弄它,但没有取得进展。

I have tried several ways, yet always seem to produce incorrect results. 我尝试了几种方法,但总是会产生不正确的结果。 My latest attempt: 我最近的尝试:

ifstream ifs(FILENAME);
    if(!ifs) {
        cerr << "Open: " << FILENAME << "\n";
        exit(1);
    }

    char in;

    while (ifs >> std::noskipws >> in) {
        int sz = 1;
        if ((in & 0xc0) == 0xc0) //0xc0 = 0b11000000
        {
                sz++;
                if((in & 0xE0) == 0xE0) //0xE0 = 0b11100000
                {
                    sz++;   
                    if((in & 0xF0) == 0xF0) //0xF0 = 0b11110000
                        sz++;   
                }
        }
        cout << sz << endl;

unsigned int a = in;
    for(int i = 1; i < sz; i++) {
        ifs >> in;
        a += in;
    }

Why do this code not work? 为什么此代码不起作用? I simply do not understand. 我根本不明白。

EDIT: Copy+Paste spaghetti...two different var names 编辑:复制+粘贴意大利面...两个不同的var名称

It appears that you're testing the wrong value. 看来您正在测试错误的值。 Your loop is reading into the value in , but you are testing against some value named c . 您的循环正在读取in的值,但是您正在针对名为c某个值进行测试。

When you read in additional characters, you're also going about it wrong. 当您读其他字符时,您也会出错。 You're using some value length instead of presumably sz . 您使用的是某个值length而不是sz And you're adding characters to an integer (which is not necessarily 32-bits by the way) instead of shifting and combining with bitwise OR. 而且您要向整数添加字符(顺便说一句,它不一定是32位),而不是按位或进行移位和组合。

Those are weird mistakes. 这些都是奇怪的错误。 Perhaps you didn't paste your real code in the question, or you actually have these values lying around in scope of your function. 也许您没有在问题中粘贴实际代码,或者实际上这些值位于函数范围之内。

I would also suggest rearranging your branching, which is a bit obtuse. 我还建议重新排列分支,这有点令人费解。 The rule, according to your code is: 根据您的代码,规则是:

mask     |   sz
---------+-------
0xxxxxxx | 1
10xxxxxx | 1
110xxxxx | 2
1110xxxx | 3
1111xxxx | 4

You could define a simple table to select a size based on the upper 4 bits. 您可以定义一个简单的表,以基于高4位选择大小。

int sizes[16];
std::fill( sizes, sizes+16, 1 );
sizes[0xc] = 2;
sizes[0xd] = 2;
sizes[0xe] = 3;
sizes[0xf] = 4;

In your loop, let's fix the c and length things, use the size table to avoid silly branching, use istream::get instead of the stream input operator ( >> ), and combine the characters into a single value in a more normal way. 在循环中,让我们修复clength问题,使用大小表避免愚蠢的分支,使用istream::get代替流输入运算符( >> ),然后以更常规的方式将字符合并为单个值。

for( char c; ifs.get(c); )
{
    // Select correct character size (bytes)
    int sz = sizes[static_cast<unsigned char>(c) >> 4];

    // Construct character
    char32_t val = c;
    while( --sz > 0 && ifs.get(c) )
    {
        val = (val << 8) | (static_cast<char32_t>(c) & 0xff);
    }

    // Output character value in hex, unless error.
    if( ifs )
    {
        std::cout << std::hex << std::fill('0') << std::setw(8) << val << std::endl;
    }
}

Now, this last part concatenates the bytes in big-endian order. 现在,最后一部分以大端顺序将字节连接在一起。 I don't know if this is correct, as I haven't read the standard. 我不知道这是否正确,因为我还没有阅读标准。 But it's much more correct than just adding values together. 但这比将价值加在一起要正确得多。 It also uses a guaranteed 32-bit datatype, unlike the unsigned int you used. 它也使用保证的32位数据类型,这与您使用的unsigned int不同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM