How to get UTF-8 codepoints of C# string?

Question

I have a German string in C#

string s = "Menü";

I would like to get UTF-8 codepoints:

expected result:

\x4D\x65\x6E\xC3\xBC

The expected result has been verified via online UTF-8 encoder/decoder and via Unicode code converter v8.1

I tried a lot of conversion methods but I cannot get the expected result.

UPDATE:

Funny, the problem was not in the source code but in the wrong encoding in the input file :-) These answers helped me a lot.

Answer 1

There's no such thing as "UTF-8 codepoints" - there are UTF-8 code units , or Unicode code points.

In the string Menü, there are 4 code points:

U+004D
U+0065
U+006E
U+00FC

For BMP characters (ie those in the range U+0000 to U+FFFF) it's as simple as iterating over the char values in a string. For non-BMP characters that's slightly trickier. StringInfo looks helpful here, but it includes combining characters when iterating over text elements. It's not terribly hard to spot surrogate pairs in a string, but I don't think there's a very simple way of iterating over all the code points in a string.

Finding the UTF-8 code units - ie the UTF-8-encoded representation of a string as bytes, is simple:

byte[] bytes = Encoding.UTF8.GetBytes(text);

That will give you the five bytes you listed in your question: 0x4d, 0x65, 0x6e, 0xc3, 0xbc.

Answer 2

Use Encoding.UTF8 , example below.

        string menu = "Menü";
        Console.WriteLine(menu);

        var utf8 = Encoding.UTF8;
        byte[] utfBytes = utf8.GetBytes(menu);
        foreach(byte b in utfBytes)
        {
            Console.WriteLine("Hex: {0:X}", b);
        }

        string menu2 = utf8.GetString(utfBytes, 0, utfBytes.Length);
        Console.WriteLine(menu2);

Answer 3

You need to explicitly convert:

var utf8 = Encoding.UTF8.GetBytes("Menü");

and utf8 contains 0x4d, 0x65, 0x6e, 0xc3, 0xbc.

How to get UTF-8 codepoints of C# string?

Question

3 answers

solution1
8 2016-08-10 11:02:20

solution2
3 ACCPTED 2016-08-10 11:08:38

solution3
1 2016-08-10 11:02:55

How to get UTF-8 codepoints of C# string?

Question

3 answers

solution1 8 2016-08-10 11:02:20

solution2 3 ACCPTED 2016-08-10 11:08:38

solution3 1 2016-08-10 11:02:55

solution1
8 2016-08-10 11:02:20

solution2
3 ACCPTED 2016-08-10 11:08:38

solution3
1 2016-08-10 11:02:55