简体   繁体   中英

Read UTF8/UNICODE characters from an escaped ASCII sequence

I have the following name in a file and I need to read the string as a UTF8-encoded string, so from this:

test_\303\246\303\270\303\245.txt

I need to obtain the following:

test_æøå.txt

Do you know how to achieve this using C#?

Assuming you have this string:

string input = "test_\\303\\246\\303\\270\\303\\245.txt";

IE literally

test_\303\246\303\270\303\245.txt

You could do this:

string input = "test_\\303\\246\\303\\270\\303\\245.txt";
Encoding iso88591 = Encoding.GetEncoding(28591); //See note at the end of answer
Encoding utf8 = Encoding.UTF8;


//Turn the octal escape sequences into characters having codepoints 0-255
//this results in a "binary string"
string binaryString = Regex.Replace(input, @"\\(?<num>[0-7]{3})", delegate(Match m)
{
    String oct = m.Groups["num"].ToString();
    return Char.ConvertFromUtf32(Convert.ToInt32(oct, 8));

});

//Turn the "binary string" into bytes
byte[] raw = iso88591.GetBytes(binaryString);

//Read the bytes into C# string
string output = utf8.GetString(raw);
Console.WriteLine(output);
//test_æøå.txt

by "binary string", I mean a string consisting only of characters with codepoints 0-255. It therefore amounts to a poor man's byte[] where you retrieve the codepoint of character at index i , instead of a byte value in a byte[] at index i (This is what we did in javascript a few years ago). Because iso-8859-1 maps exactly the first 256 unicode code points into a single byte, it's perfect for converting a "binary string" into a byte[] .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM