[英]Convert a Unicode string to an escaped ASCII string
How can I convert this string:如何转换此字符串:
This string contains the Unicode character Pi(π)
into an escaped ASCII string:转换为转义的 ASCII 字符串:
This string contains the Unicode character Pi(\u03a0)
and vice versa ?反之亦然?
The current Encoding available in C# converts the π character to "?". C# 中可用的当前编码将 π 字符转换为“?”。 I need to preserve that character.
我需要保留那个性格。
This goes back and forth to and from the \\uXXXX format.这在 \\uXXXX 格式之间来回切换。
class Program {
static void Main( string[] args ) {
string unicodeString = "This function contains a unicode character pi (\u03a0)";
Console.WriteLine( unicodeString );
string encoded = EncodeNonAsciiCharacters(unicodeString);
Console.WriteLine( encoded );
string decoded = DecodeEncodedNonAsciiCharacters( encoded );
Console.WriteLine( decoded );
}
static string EncodeNonAsciiCharacters( string value ) {
StringBuilder sb = new StringBuilder();
foreach( char c in value ) {
if( c > 127 ) {
// This character is too big for ASCII
string encodedValue = "\\u" + ((int) c).ToString( "x4" );
sb.Append( encodedValue );
}
else {
sb.Append( c );
}
}
return sb.ToString();
}
static string DecodeEncodedNonAsciiCharacters( string value ) {
return Regex.Replace(
value,
@"\\u(?<Value>[a-zA-Z0-9]{4})",
m => {
return ((char) int.Parse( m.Groups["Value"].Value, NumberStyles.HexNumber )).ToString();
} );
}
}
Outputs:输出:
This function contains a unicode character pi (π)此函数包含一个 unicode 字符 pi (π)
This function contains a unicode character pi (\Π)此函数包含一个 unicode 字符 pi (\Π)
This function contains a unicode character pi (π)此函数包含一个 unicode 字符 pi (π)
For Unescape You can simply use this functions:对于Unescape,您可以简单地使用以下功能:
System.Text.RegularExpressions.Regex.Unescape(string)
System.Uri.UnescapeDataString(string)
I suggest using this method (It works better with UTF-8):我建议使用这种方法(使用 UTF-8 效果更好):
UnescapeDataString(string)
string StringFold(string input, Func<char, string> proc)
{
return string.Concat(input.Select(proc).ToArray());
}
string FoldProc(char input)
{
if (input >= 128)
{
return string.Format(@"\u{0:x4}", (int)input);
}
return input.ToString();
}
string EscapeToAscii(string input)
{
return StringFold(input, FoldProc);
}
As a one-liner:作为单线:
var result = Regex.Replace(input, @"[^\x00-\x7F]", c =>
string.Format(@"\u{0:x4}", (int)c.Value[0]));
class Program
{
static void Main(string[] args)
{
char[] originalString = "This string contains the unicode character Pi(π)".ToCharArray();
StringBuilder asAscii = new StringBuilder(); // store final ascii string and Unicode points
foreach (char c in originalString)
{
// test if char is ascii, otherwise convert to Unicode Code Point
int cint = Convert.ToInt32(c);
if (cint <= 127 && cint >= 0)
asAscii.Append(c);
else
asAscii.Append(String.Format("\\u{0:x4} ", cint).Trim());
}
Console.WriteLine("Final string: {0}", asAscii);
Console.ReadKey();
}
}
All non-ASCII chars are converted to their Unicode Code Point representation and appended to the final string.所有非 ASCII 字符都转换为其 Unicode 代码点表示形式并附加到最终字符串。
Here is my current implementation:这是我目前的实现:
public static class UnicodeStringExtensions
{
public static string EncodeNonAsciiCharacters(this string value) {
var bytes = Encoding.Unicode.GetBytes(value);
var sb = StringBuilderCache.Acquire(value.Length);
bool encodedsomething = false;
for (int i = 0; i < bytes.Length; i += 2) {
var c = BitConverter.ToUInt16(bytes, i);
if ((c >= 0x20 && c <= 0x7f) || c == 0x0A || c == 0x0D) {
sb.Append((char) c);
} else {
sb.Append($"\\u{c:x4}");
encodedsomething = true;
}
}
if (!encodedsomething) {
StringBuilderCache.Release(sb);
return value;
}
return StringBuilderCache.GetStringAndRelease(sb);
}
public static string DecodeEncodedNonAsciiCharacters(this string value)
=> Regex.Replace(value,/*language=regexp*/@"(?:\\u[a-fA-F0-9]{4})+", Decode);
static readonly string[] Splitsequence = new [] { "\\u" };
private static string Decode(Match m) {
var bytes = m.Value.Split(Splitsequence, StringSplitOptions.RemoveEmptyEntries)
.Select(s => ushort.Parse(s, NumberStyles.HexNumber)).SelectMany(BitConverter.GetBytes).ToArray();
return Encoding.Unicode.GetString(bytes);
}
}
This passes a test:这通过了一个测试:
public void TestBigUnicode() {
var s = "\U00020000";
var encoded = s.EncodeNonAsciiCharacters();
var decoded = encoded.DecodeEncodedNonAsciiCharacters();
Assert.Equals(s, decoded);
}
with the encoded value: "\?\?"
使用编码值:
"\?\?"
This implementation makes use of a StringBuilderCache (reference source link)此实现使用StringBuilderCache (参考源链接)
A small patch to @Adam Sills's answer which solves FormatException
on cases where the input string like "c:\«\\otherdirectory\\ " plus RegexOptions.Compiled
makes the Regex
compilation much faster: @Adam Sills 的答案的一个小补丁,它解决了
FormatException
在输入字符串如"c:\«\\otherdirectory\\ " 加上RegexOptions.Compiled
使Regex
编译速度更快的情况下:
private static Regex DECODING_REGEX = new Regex(@"\\u(?<Value>[a-fA-F0-9]{4})", RegexOptions.Compiled);
private const string PLACEHOLDER = @"#!#";
public static string DecodeEncodedNonAsciiCharacters(this string value)
{
return DECODING_REGEX.Replace(
value.Replace(@"\\", PLACEHOLDER),
m => {
return ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString(); })
.Replace(PLACEHOLDER, @"\\");
}
To store actual Unicode codepoints, you have to first decode the String's UTF-16 codeunits to UTF-32 codeunits (which are currently the same as the Unicode codepoints).要存储实际的 Unicode 代码点,您必须首先将字符串的 UTF-16 代码单元解码为 UTF-32 代码单元(目前与 Unicode 代码点相同)。 Use
System.Text.Encoding.UTF32.GetBytes()
for that, and then write the resulting bytes to the StringBuilder
as needed,ie为此使用
System.Text.Encoding.UTF32.GetBytes()
,然后根据需要将结果字节写入StringBuilder
,即
static void Main(string[] args)
{
String originalString = "This string contains the unicode character Pi(π)";
Byte[] bytes = Encoding.UTF32.GetBytes(originalString);
StringBuilder asAscii = new StringBuilder();
for (int idx = 0; idx < bytes.Length; idx += 4)
{
uint codepoint = BitConverter.ToUInt32(bytes, idx);
if (codepoint <= 127)
asAscii.Append(Convert.ToChar(codepoint));
else
asAscii.AppendFormat("\\u{0:x4}", codepoint);
}
Console.WriteLine("Final string: {0}", asAscii);
Console.ReadKey();
}
You need to use the Convert()
method in the Encoding
class:您需要使用
Encoding
类中的Convert()
方法:
Encoding
object that represents ASCII encodingEncoding
对象Encoding
object that represents Unicode encodingEncoding
对象Encoding.Convert()
with the source encoding, the destination encoding, and the string to be encodedEncoding.Convert()
There is an example here :有一个例子在这里:
using System;
using System.Text;
namespace ConvertExample
{
class ConvertExampleClass
{
static void Main()
{
string unicodeString = "This string contains the unicode character Pi(\u03a0)";
// Create two different encodings.
Encoding ascii = Encoding.ASCII;
Encoding unicode = Encoding.Unicode;
// Convert the string into a byte[].
byte[] unicodeBytes = unicode.GetBytes(unicodeString);
// Perform the conversion from one encoding to the other.
byte[] asciiBytes = Encoding.Convert(unicode, ascii, unicodeBytes);
// Convert the new byte[] into a char[] and then into a string.
// This is a slightly different approach to converting to illustrate
// the use of GetCharCount/GetChars.
char[] asciiChars = new char[ascii.GetCharCount(asciiBytes, 0, asciiBytes.Length)];
ascii.GetChars(asciiBytes, 0, asciiBytes.Length, asciiChars, 0);
string asciiString = new string(asciiChars);
// Display the strings created before and after the conversion.
Console.WriteLine("Original string: {0}", unicodeString);
Console.WriteLine("Ascii converted string: {0}", asciiString);
}
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.