简体   繁体   English

使用 C++ 将越南语字符从 ISO88591、UTF8、UTF16BE、UTF16LE、UTF16 编码为十六进制,反之亦然

[英]Encoding Vietnamese characters from ISO88591, UTF8, UTF16BE, UTF16LE, UTF16 to Hex and vice versa using C++

I have edited my post.我已经编辑了我的帖子。 Currently what I'm trying to do is to encode an input string from the user and then convert it to Hex formats.目前我正在尝试做的是对来自用户的输入字符串进行编码,然后将其转换为十六进制格式。 I can do it properly if it does not contain any Vietnamese character.如果它不包含任何越南字符,我可以正确地做到这一点。 If my inputString is "Hello".如果我的 inputString 是“Hello”。 But when I try to input a string such as "Tôi", I don't know how to do it.但是当我尝试输入诸如“Tôi”之类的字符串时,我不知道该怎么做。

    enum Encodings { USASCII, ISO88591, UTF8, UTF16BE, UTF16LE, UTF16, BIN, OCT, HEX };

    switch (Encodings)
        {
        case USASCII:
            ASCIIToHex(inputString, &ascii); //hello output 48656C6C6F
            return new ByteField(ascii.c_str());
        case ISO88591:
            ASCIIToHex(inputString, &ascii);//hello output 48656C6C6F
                                              //tôi output 54F469
            return new ByteField(ascii.c_str());
        case UTF8:
            ASCIIToHex(inputString, &ascii);//hello output 48656C6C6F
                                           //tôi output 54C3B469
            return new ByteField(ascii.c_str());
        case UTF16BE:
            ToUTF16(inputString, &ascii, Encodings);//hello output 00480065006C006C006F
                                                    //tôi output 005400F40069
            return new ByteField(ascii.c_str());
        case UTF16:
        ToUTF16(inputString, &ascii, Encodings);//hello output FEFF00480065006C006C006F
                                                //tôi output FEFF005400F40069
            return new ByteField(ascii.c_str());
        case UTF16LE:
            ToUTF16(inputString, &ascii, Encodings);//hello output 480065006C006C006F00
                                                    //tôi output 5400F4006900
            return new ByteField(ascii.c_str());
        }

void StringUtilLib::ASCIIToHex(std::string s, std::string * result)
{
    int n = s.length();
    for (int i = 0; i < n; i++)
    {
        unsigned char c = s[i];
        long val = long(c);
        std::string bin = "";
        while (val > 0)
        {
            (val % 2) ? bin.push_back('1') :
                bin.push_back('0');
            val /= 2;
        }
        reverse(bin.begin(), bin.end());
        result->append(ConvertBinToHex(bin));
    }
}

std::string ToUTF16(std::string s, std::string * result, int encodings) {
    int n = s.length();
    if (encodings == UTF16) {
        result->append("FEFF");
    }
    for (int i = 0; i < n; i++)
    {
        int val = int(s[i]);
        std::string bin = "";
        while (val > 0)
        {
            (val % 2) ? bin.push_back('1') :
                bin.push_back('0');
            val /= 2;
        }
        reverse(bin.begin(), bin.end());
        if (encodings == UTF16 || encodings == UTF16BE) {
            result->append("00" + ConvertBinToHex(bin));
        }
        if (encodings == UTF16LE) {
            result->append(ConvertBinToHex(bin) + "00");
        }

    }
}

std::string ConvertBinToHex(std::string str) {
    long long temp = atoll(str.c_str());
    int dec_value = 0;
    int base = 1;
    int i = 0;
    while (temp) {
        int last_digit = temp % 10;
        temp = temp / 10;
        dec_value += last_digit * base;
        base = base * 2;
    }
    char hexaDeciNum[10];
    while (dec_value != 0)
    {
        int temp = 0;
        temp = dec_value % 16;
        if (temp < 10)
        {
            hexaDeciNum[i] = temp + 48;
            i++;
        }
        else
        {
            hexaDeciNum[i] = temp + 55;
            i++;
        }
        dec_value = dec_value / 16;
    }
    str.clear();
    for (int j = i - 1; j >= 0; j--) {
        str = str + hexaDeciNum[j];
    }
    return str;
}

The question is completely unclear.这个问题完全不清楚。 To encode something you need an input right?要对某些内容进行编码,您需要输入吗? So when you say "Encoding Vietnamese Character to UTF8, UTF16" what's your input string and what's the encoding before converting to UTF-8/16?因此,当您说“将越南字符编码为 UTF8、UTF16”时,您的输入字符串是什么,转换为 UTF-8/16 之前的编码是什么? How do you input it?你是怎么输入的? From file or console?从文件或控制台?

And why on earth are you converting to binary and then to hex?为什么你要转换为二进制然后转换为十六进制? You can print directly to binary and hex from the bytes, no need to convert from binary to hex.您可以从字节直接打印为二进制和十六进制,无需从二进制转换为十六进制。 Note that converting to binary like that is fine for testing but vastly inefficient in production code.请注意,像这样转换为二进制文件非常适合测试,但在生产代码中效率非常低。 I also don't know what you mean by "But what if my letter is "Á" or "À" which is a Vietnamese letter I cannot get the value of it" .我也不知道您所说的“但是如果我的字母是越南字母“Á”或“À”,我无法获得它的价值怎么办” Please show a minimal, reproducible example along with the input/output请显示一个最小的、可重现的示例以及输入/输出


But I think you just want to output the UTF encoded bytes from a string literal in the source code like "ÁÀ".但我认为你只想 output 来自源代码中的字符串文字的 UTF 编码字节,如“ÁÀ”。 In that case it isn't called "encoding a string" but just "outputting a string"在那种情况下,它不称为“编码字符串”,而只是“输出字符串”

Both Á and À in Unicode can be represented by precomposed characters (U+00C1 and U+00C0) or combining characters (A + U+0301 ◌́/U+0300 ◌̀). Unicode中的ÁÀ都可以用预组合字符(U+00C1和U+00C0)或组合字符(A+U+0301◌́/U+0300◌̀)来表示。 You can switch between them by selecting "Unicode dựng sẵn" or "Unicode tổ hợp" in Unikey.您可以通过在 Unikey 中选择“Unicode dựng sẵn”“Unicode tổ hợp”在它们之间进行切换。 Suppose you have those characters in string literal form then std::string str = "ÁÀ" contains a series of bytes that corresponds to the above letters in the source file encoding.假设您有字符串文字形式的这些字符,则std::string str = "ÁÀ"包含一系列与源文件编码中的上述字母相对应的字节。 So depending on which encoding you save the *.cpp file as (CP1252, CP1258, UTF-8...), the output byte values will be different因此,根据您将 *.cpp 文件保存为(CP1252、CP1258、UTF-8...)的编码,output 字节值会有所不同

To force UTF-8/16/32 encoding you just need to use the u8 , u and U suffix respectively, along with the correct type ( char8_t , char16_t , char32_t or std::u8string / std::u16string / std::u32string )要强制 UTF-8/16/32 编码,您只需要分别使用u8uU后缀以及正确的类型( char8_tchar16_tchar32_tstd::u8string / std::u16string / std::u32string )

std::u8string  utf8  = u8"ÁÀ";
std::u16string utf16 = u"ÁÀ";
std::u32string utf32 = U"ÁÀ";

Then just use c_str() to get the underlying buffers and print the bytes.然后只需使用c_str()获取底层缓冲区并打印字节。 In C++14 std::u8string is not available yet so just save the file as UTF-8 and use std::string .在 C++14 std::u8string尚不可用,因此只需将文件另存为 UTF-8 并使用std::string Similarly you can read std::u*string directly from std::cin to print the encoding of a user-input string同样,您可以直接从std::cin读取std::u*string以打印用户输入字符串的编码

Edit:编辑:

To convert between UTF encodings use the standard std::codecvt , std::wstring_convert , std::codecvt_utf8_utf16 ...要在 UTF 编码之间进行转换,请使用标准std::codecvtstd::wstring_convertstd::codecvt_utf8_utf16 ...

Working on non-Unicode encodings is trickier and needs some external library like ICU or OS-dependent APIs处理非 Unicode 编码比较棘手,需要一些外部库,如ICU或依赖于操作系统的 API

Limiting to ISO-8859-1 makes it easier but you still need many lookup tables, and there's no way to convert other encodings to ASCII without loss of information限制为 ISO-8859-1 会更容易,但您仍然需要许多查找表,并且无法在不丢失信息的情况下将其他编码转换为 ASCII

-64 is the correct representation of À if you are using signed char and CP1258.如果您使用有符号字符和 CP1258,-64 是 À 的正确表示。 If you want a positive number you need to cast to unsigned char first.如果你想要一个正数,你需要先转换为unsigned char

If you are indeed using CP1258, you are probably on Windows.如果您确实在使用 CP1258,那么您可能正在使用 Windows。 To convert your input string to UTF-16, you probably want to use a Windows platform API such as MultiByteToWideChar which accepts a code page parameter (of course you have to use the correct code page).要将输入字符串转换为 UTF-16,您可能需要使用 Windows 平台 API,例如MultiByteToWideChar ,它接受代码页参数(当然您必须使用正确的代码页)。 Alternatively you may try a standard function likembstowcs but you need to set up your locale correctly before using it.或者,您可以尝试使用标准的mbstowcs ,例如 mbstowcs,但您需要在使用之前正确设置您的语言环境。

You might find it easier to switch to wide characters throughout your application, and avoid most transcoding.您可能会发现在整个应用程序中切换到宽字符会更容易,并且可以避免大多数转码。

As a side note, converting an integer to binary only to convert that to hexadecimal is not an easy or efficient way to display a hexadecimal representation of an integer.附带说明一下,仅将 integer 转换为二进制以将其转换为十六进制并不是显示 integer 的十六进制表示的简单或有效方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM