I have a German string in C#
string s = "Menü";
I would like to get UTF-8 codepoints:
expected result:
\x4D\x65\x6E\xC3\xBC
The expected result has been verified via online UTF-8 encoder/decoder and via Unicode code converter v8.1
I tried a lot of conversion methods but I cannot get the expected result.
UPDATE:
Funny, the problem was not in the source code but in the wrong encoding in the input file :-) These answers helped me a lot.
There's no such thing as "UTF-8 codepoints" - there are UTF-8 code units , or Unicode code points.
In the string Menü, there are 4 code points:
For BMP characters (ie those in the range U+0000 to U+FFFF) it's as simple as iterating over the char
values in a string. For non-BMP characters that's slightly trickier. StringInfo
looks helpful here, but it includes combining characters when iterating over text elements. It's not terribly hard to spot surrogate pairs in a string, but I don't think there's a very simple way of iterating over all the code points in a string.
Finding the UTF-8 code units - ie the UTF-8-encoded representation of a string as bytes, is simple:
byte[] bytes = Encoding.UTF8.GetBytes(text);
That will give you the five bytes you listed in your question: 0x4d, 0x65, 0x6e, 0xc3, 0xbc.
Use Encoding.UTF8
, example below.
string menu = "Menü";
Console.WriteLine(menu);
var utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(menu);
foreach(byte b in utfBytes)
{
Console.WriteLine("Hex: {0:X}", b);
}
string menu2 = utf8.GetString(utfBytes, 0, utfBytes.Length);
Console.WriteLine(menu2);
You need to explicitly convert:
var utf8 = Encoding.UTF8.GetBytes("Menü");
and utf8
contains 0x4d, 0x65, 0x6e, 0xc3, 0xbc.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.