简体   繁体   中英

Encoding Vietnamese characters from ISO88591, UTF8, UTF16BE, UTF16LE, UTF16 to Hex and vice versa using C++

I have edited my post. Currently what I'm trying to do is to encode an input string from the user and then convert it to Hex formats. I can do it properly if it does not contain any Vietnamese character. If my inputString is "Hello". But when I try to input a string such as "Tôi", I don't know how to do it.

    enum Encodings { USASCII, ISO88591, UTF8, UTF16BE, UTF16LE, UTF16, BIN, OCT, HEX };

    switch (Encodings)
        {
        case USASCII:
            ASCIIToHex(inputString, &ascii); //hello output 48656C6C6F
            return new ByteField(ascii.c_str());
        case ISO88591:
            ASCIIToHex(inputString, &ascii);//hello output 48656C6C6F
                                              //tôi output 54F469
            return new ByteField(ascii.c_str());
        case UTF8:
            ASCIIToHex(inputString, &ascii);//hello output 48656C6C6F
                                           //tôi output 54C3B469
            return new ByteField(ascii.c_str());
        case UTF16BE:
            ToUTF16(inputString, &ascii, Encodings);//hello output 00480065006C006C006F
                                                    //tôi output 005400F40069
            return new ByteField(ascii.c_str());
        case UTF16:
        ToUTF16(inputString, &ascii, Encodings);//hello output FEFF00480065006C006C006F
                                                //tôi output FEFF005400F40069
            return new ByteField(ascii.c_str());
        case UTF16LE:
            ToUTF16(inputString, &ascii, Encodings);//hello output 480065006C006C006F00
                                                    //tôi output 5400F4006900
            return new ByteField(ascii.c_str());
        }

void StringUtilLib::ASCIIToHex(std::string s, std::string * result)
{
    int n = s.length();
    for (int i = 0; i < n; i++)
    {
        unsigned char c = s[i];
        long val = long(c);
        std::string bin = "";
        while (val > 0)
        {
            (val % 2) ? bin.push_back('1') :
                bin.push_back('0');
            val /= 2;
        }
        reverse(bin.begin(), bin.end());
        result->append(ConvertBinToHex(bin));
    }
}

std::string ToUTF16(std::string s, std::string * result, int encodings) {
    int n = s.length();
    if (encodings == UTF16) {
        result->append("FEFF");
    }
    for (int i = 0; i < n; i++)
    {
        int val = int(s[i]);
        std::string bin = "";
        while (val > 0)
        {
            (val % 2) ? bin.push_back('1') :
                bin.push_back('0');
            val /= 2;
        }
        reverse(bin.begin(), bin.end());
        if (encodings == UTF16 || encodings == UTF16BE) {
            result->append("00" + ConvertBinToHex(bin));
        }
        if (encodings == UTF16LE) {
            result->append(ConvertBinToHex(bin) + "00");
        }

    }
}

std::string ConvertBinToHex(std::string str) {
    long long temp = atoll(str.c_str());
    int dec_value = 0;
    int base = 1;
    int i = 0;
    while (temp) {
        int last_digit = temp % 10;
        temp = temp / 10;
        dec_value += last_digit * base;
        base = base * 2;
    }
    char hexaDeciNum[10];
    while (dec_value != 0)
    {
        int temp = 0;
        temp = dec_value % 16;
        if (temp < 10)
        {
            hexaDeciNum[i] = temp + 48;
            i++;
        }
        else
        {
            hexaDeciNum[i] = temp + 55;
            i++;
        }
        dec_value = dec_value / 16;
    }
    str.clear();
    for (int j = i - 1; j >= 0; j--) {
        str = str + hexaDeciNum[j];
    }
    return str;
}

The question is completely unclear. To encode something you need an input right? So when you say "Encoding Vietnamese Character to UTF8, UTF16" what's your input string and what's the encoding before converting to UTF-8/16? How do you input it? From file or console?

And why on earth are you converting to binary and then to hex? You can print directly to binary and hex from the bytes, no need to convert from binary to hex. Note that converting to binary like that is fine for testing but vastly inefficient in production code. I also don't know what you mean by "But what if my letter is "Á" or "À" which is a Vietnamese letter I cannot get the value of it" . Please show a minimal, reproducible example along with the input/output


But I think you just want to output the UTF encoded bytes from a string literal in the source code like "ÁÀ". In that case it isn't called "encoding a string" but just "outputting a string"

Both Á and À in Unicode can be represented by precomposed characters (U+00C1 and U+00C0) or combining characters (A + U+0301 ◌́/U+0300 ◌̀). You can switch between them by selecting "Unicode dựng sẵn" or "Unicode tổ hợp" in Unikey. Suppose you have those characters in string literal form then std::string str = "ÁÀ" contains a series of bytes that corresponds to the above letters in the source file encoding. So depending on which encoding you save the *.cpp file as (CP1252, CP1258, UTF-8...), the output byte values will be different

To force UTF-8/16/32 encoding you just need to use the u8 , u and U suffix respectively, along with the correct type ( char8_t , char16_t , char32_t or std::u8string / std::u16string / std::u32string )

std::u8string  utf8  = u8"ÁÀ";
std::u16string utf16 = u"ÁÀ";
std::u32string utf32 = U"ÁÀ";

Then just use c_str() to get the underlying buffers and print the bytes. In C++14 std::u8string is not available yet so just save the file as UTF-8 and use std::string . Similarly you can read std::u*string directly from std::cin to print the encoding of a user-input string

Edit:

To convert between UTF encodings use the standard std::codecvt , std::wstring_convert , std::codecvt_utf8_utf16 ...

Working on non-Unicode encodings is trickier and needs some external library like ICU or OS-dependent APIs

Limiting to ISO-8859-1 makes it easier but you still need many lookup tables, and there's no way to convert other encodings to ASCII without loss of information

-64 is the correct representation of À if you are using signed char and CP1258. If you want a positive number you need to cast to unsigned char first.

If you are indeed using CP1258, you are probably on Windows. To convert your input string to UTF-16, you probably want to use a Windows platform API such as MultiByteToWideChar which accepts a code page parameter (of course you have to use the correct code page). Alternatively you may try a standard function likembstowcs but you need to set up your locale correctly before using it.

You might find it easier to switch to wide characters throughout your application, and avoid most transcoding.

As a side note, converting an integer to binary only to convert that to hexadecimal is not an easy or efficient way to display a hexadecimal representation of an integer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM