简体   繁体   English

如何将 Unicode 转义序列转换为 .NET 字符串中的 Unicode 字符?

[英]How do I convert Unicode escape sequences to Unicode characters in a .NET string?

Say you've loaded a text file into a string, and you'd like to convert all Unicode escapes into actual Unicode characters inside of the string.假设您已将一个文本文件加载到一个字符串中,并且您希望将所有 Unicode 转义符转换为字符串内的实际 Unicode 字符。

Example:例子:

"The following is the top half of an integral character in Unicode '\⌠', and this is the lower half '\\U2321'." “下面是 Unicode '\⌠' 中整数字符的上半部分,这是下半部分 '\\U2321'。”

The answer is simple and works well with strings up to at least several thousand characters. 答案很简单,并且适用于至少数千个字符的字符串。

Example 1: 范例1:

Regex  rx = new Regex( @"\\[uU]([0-9A-F]{4})" );
result = rx.Replace( result, match => ((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString() );

Example 2: 范例2:

Regex  rx = new Regex( @"\\[uU]([0-9A-F]{4})" );
result = rx.Replace( result, delegate (Match match) { return ((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString(); } );

The first example shows the replacement being made using a lambda expression (C# 3.0) and the second uses a delegate which should work with C# 2.0. 第一个示例显示了使用lambda表达式(C#3.0)进行的替换,第二个示例使用了应与C#2.0一起使用的委托。

To break down what's going on here, first we create a regular expression: 为了分解这里发生的事情,首先我们创建一个正则表达式:

new Regex( @"\\[uU]([0-9A-F]{4})" );

Then we call Replace() with the string 'result' and an anonymous method (lambda expression in the first example and the delegate in the second - the delegate could also be a regular method) that converts each regular expression that is found in the string. 然后,我们使用字符串'result'和一个匿名方法(在第一个示例中为lambda表达式,在第二个示例中为委托-委托也可以是正则方法)调用Replace(),该方法将转换字符串中找到的每个正则表达式。

The Unicode escape is processed like this: Unicode转义的处理方式如下:

((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString(); });

Get the string representing the number part of the escape (skip the first two characters). 获取表示转义的数字部分的字符串(跳过前两个字符)。

match.Value.Substring(2)

Parse that string using Int32.Parse() which takes the string and the number format that the Parse() function should expect which in this case is a hex number. 使用Int32.Parse()解析该字符串,该字符串采用Parse()函数应该期望的字符串和数字格式,在这种情况下为十六进制数字。

NumberStyles.HexNumber

Then we cast the resulting number to a Unicode character: 然后,我们将结果数字转换为Unicode字符:

(char)

And finally we call ToString() on the Unicode character which gives us its string representation which is the value passed back to Replace(): 最后,我们在Unicode字符上调用ToString(),它为我们提供了其字符串表示形式,该字符串表示形式是传递回Replace()的值:

.ToString()

Note: Instead of grabbing the text to be converted with a Substring call you could use the match parameter's GroupCollection, and a subexpressions in the regular expression to capture just the number ('2320'), but that's more complicated and less readable. 注意:可以使用match参数的GroupCollection和正则表达式中的子表达式来捕获数字(而不是使用Substring调用来捕获要转换的文本),以仅捕获数字('2320'),但这更加复杂且可读性较低。

Refactored a little more: 重构多一点:

Regex regex = new Regex (@"\\U([0-9A-F]{4})", RegexOptions.IgnoreCase);
string line = "...";
line = regex.Replace (line, match => ((char)int.Parse (match.Groups[1].Value,
  NumberStyles.HexNumber)).ToString ());

This is the VB.NET equivalent: 这是VB.NET的等效项:

Dim rx As New RegularExpressions.Regex("\\[uU]([0-9A-Fa-f]{4})")
result = rx.Replace(result, Function(match) CChar(ChrW(Int32.Parse(match.Value.Substring(2), Globalization.NumberStyles.HexNumber))).ToString())

I think you better add the small letters to your regular expression. 我认为您最好在正则表达式中添加小写字母。 It worked better for me. 对我来说效果更好。

Regex rx = new Regex(@"\\[uU]([0-9A-Fa-f]{4})");
result = rx.Replace(result, match => ((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString());

add UnicodeExtensions.cs class to your project:UnicodeExtensions.cs类添加到您的项目中:

public static class UnicodeExtensions
{
    private static readonly Regex Regex = new Regex(@"\\[uU]([0-9A-Fa-f]{4})");

    public static string UnescapeUnicode(this string str)
    {
        return Regex.Replace(str,
            match => ((char) int.Parse(match.Value.Substring(2),
                NumberStyles.HexNumber)).ToString());
    }
}

usage:用法:

var test = "\\u0074\\u0068\\u0069\\u0073 \\u0069\\u0073 \\u0074\\u0065\\u0073\\u0074\\u002e";
var output = test.UnescapeUnicode();   // output is => this is test.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM