简体   繁体   中英

c# - How to convert a converted UTF8 string to UTF16?

I'm trying to convert a converted UTF-8 string to UTF-16, because I'm going to read a file and it comes like the var strUTF8 below.

For example, the entry would be the string "Não é possÃvel equipar" and the return I needed is "Não é possível equipar" .

static void Main(string[] args)
{
    test3();
    Console.ReadKey();
}

static void test3()
{
    string str = "Não é possível equipar";
    string strUTF16 = Utf8ToUtf16(str);

    Console.WriteLine(str);
    Console.WriteLine(strUTF16);
}

static string Utf8ToUtf16(string utf8String)
{
    byte[] utf8Bytes = Encoding.UTF8.GetBytes(utf8String);
    byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, utf8Bytes);
    return Encoding.Unicode.GetString(unicodeBytes);
}

I really don't know how to solve this. Any tips?

If you want to read a file then you should read a file. When you read the file, specify the encoding of that file. If I'm not mistaken UTF8 is the default, so reading files encoded with UTF8 doesn't require the encoding to be specified. If you want to save that text to a file with a specific encoding, specify that encoding when saving the file.

var text = File.ReadAllText(filePath, Encoding.UTF8);

File.WriteAllText(filePath, text, Encoding.Unicode);

That will effectively convert a file from UTF8 encoding to UTF16. A more verbose version would be:

var data = File.ReadAllBytes(filePath);
var text = Encoding.UTF8.GetString(data);

data = Encoding.Unicode.GetBytes(text);
File.WriteAllBytes(filePath, data);

Your Utf8ToUtf16() function is effectively a no-op . You are taking an arbitrary UTF-16 string as input, encoding it into UTF-8 bytes, then decoding those bytes as UTF-8 back into UTF-16. So, you effectively end up with the same string value you started with. You may as well have just written the following, the result would be the same:

static string Utf8ToUtf16(string utf8String)
{
    return utf8String;
}

That being said, Não é possÃvel equipar is what you get when the UTF-8 encoded form of Não é possível equipar is mis -interpreted as Latin (probably ISO-8859-1) or Windows-125x etc, instead of being properly interpreted as UTF-8 to begin with.

If you have a C# string that contains such UTF-8 bytes which were up-scaled as-is to UTF-16 (why???), then you need to down-scale those characters as-is back into 8-bit bytes, and then you can decode those bytes as UTF-8, eg:

static void test3()
{
    string str = "Não é possível equipar";
    string strUTF16 = Utf8ToUtf16(str);

    Console.WriteLine(str);
    Console.WriteLine(strUTF16);
}

static string Utf8ToUtf16(string utf8String)
{
    byte[] utf8Bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(utf8String); // or: GetEncoding(28591)
    return Encoding.UTF8.GetString(utf8Bytes);
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM