How to decode/encode a UTF-8 char in c++ without wchar_t

Question

As the title states, I am attempting to decode/encode UTF-8 characters to a char, but I want to do it without using wchar_t or the like. I want to do the leg work myself. This way I know that I understand it, which I obviously don't or it would be working. I've spent about a week toying with it and am just not making progress.

I have tried several ways, yet always seem to produce incorrect results. My latest attempt:

ifstream ifs(FILENAME);
    if(!ifs) {
        cerr << "Open: " << FILENAME << "\n";
        exit(1);
    }

    char in;

    while (ifs >> std::noskipws >> in) {
        int sz = 1;
        if ((in & 0xc0) == 0xc0) //0xc0 = 0b11000000
        {
                sz++;
                if((in & 0xE0) == 0xE0) //0xE0 = 0b11100000
                {
                    sz++;   
                    if((in & 0xF0) == 0xF0) //0xF0 = 0b11110000
                        sz++;   
                }
        }
        cout << sz << endl;

unsigned int a = in;
    for(int i = 1; i < sz; i++) {
        ifs >> in;
        a += in;
    }

Why do this code not work? I simply do not understand.

EDIT: Copy+Paste spaghetti...two different var names

Answer 1

It appears that you're testing the wrong value. Your loop is reading into the value in , but you are testing against some value named c .

When you read in additional characters, you're also going about it wrong. You're using some value length instead of presumably sz . And you're adding characters to an integer (which is not necessarily 32-bits by the way) instead of shifting and combining with bitwise OR.

Those are weird mistakes. Perhaps you didn't paste your real code in the question, or you actually have these values lying around in scope of your function.

I would also suggest rearranging your branching, which is a bit obtuse. The rule, according to your code is:

mask     |   sz
---------+-------
0xxxxxxx | 1
10xxxxxx | 1
110xxxxx | 2
1110xxxx | 3
1111xxxx | 4

You could define a simple table to select a size based on the upper 4 bits.

int sizes[16];
std::fill( sizes, sizes+16, 1 );
sizes[0xc] = 2;
sizes[0xd] = 2;
sizes[0xe] = 3;
sizes[0xf] = 4;

In your loop, let's fix the c and length things, use the size table to avoid silly branching, use istream::get instead of the stream input operator ( >> ), and combine the characters into a single value in a more normal way.

for( char c; ifs.get(c); )
{
    // Select correct character size (bytes)
    int sz = sizes[static_cast<unsigned char>(c) >> 4];

    // Construct character
    char32_t val = c;
    while( --sz > 0 && ifs.get(c) )
    {
        val = (val << 8) | (static_cast<char32_t>(c) & 0xff);
    }

    // Output character value in hex, unless error.
    if( ifs )
    {
        std::cout << std::hex << std::fill('0') << std::setw(8) << val << std::endl;
    }
}

Now, this last part concatenates the bytes in big-endian order. I don't know if this is correct, as I haven't read the standard. But it's much more correct than just adding values together. It also uses a guaranteed 32-bit datatype, unlike the unsigned int you used.

How to decode/encode a UTF-8 char in c++ without wchar_t

Question

1 answers

solution1
0 ACCPTED 2017-04-12 23:14:57

How to decode/encode a UTF-8 char in c++ without wchar_t

Question

1 answers

solution1 0 ACCPTED 2017-04-12 23:14:57

solution1
0 ACCPTED 2017-04-12 23:14:57