简体   繁体   English

如何在 C# 中将 Unicode 字符串拆分为多个 Unicode 字符?

[英]How can I split a Unicode string into multiple Unicode characters in C#?

If I have a string like "😀123👨‍👩‍👧‍👦" , how can I split it into an array, which would look like ["😀", "1", "2", "3", "👨‍👩‍👧‍👦"] ?如果我有一个像"😀123👨‍👩‍👧‍👦"这样的字符串,我如何将它拆分成一个数组,看起来像["😀", "1", "2", "3", "👨‍👩‍👧‍👦"] ? If I use ToCharArray() the first Emoji is split into 2 characters and the second into 7 characters.如果我使用ToCharArray() ,第一个表情符号被分成 2 个字符,第二个被分成 7 个字符。

Update更新

The solution now looks like this:解决方案现在看起来像这样:

public static List<string> GetCharacters(string text)
{
    char[] ca = text.ToCharArray();
    List<string> characters = new List<string>();
    for (int i = 0; i < ca.Length; i++)
    {
        char c = ca[i];
        if (c > ‭65535‬) continue;
        if (char.IsHighSurrogate(c))
        {
            i++;
            characters.Add(new string(new[] { c, ca[i] }));
        }
        else
            characters.Add(new string(new[] { c }));
    }
    return characters;
}

Please note that, as mentioned in the comments, it doesn't work for the family emoji.请注意,如评论中所述,它不适用于家庭表情符号。 It only works for emojis that have 2 characters or less.它仅适用于 2 个字符或更少的表情符号。 The output of the example would be: ["😀", "1", "2", "3", "👨‍", "👩‍", "👧‍", "👦"]该示例的输出将是: ["😀", "1", "2", "3", "👨‍", "👩‍", "👧‍", "👦"]

.NET represents strings as a sequence of UTF-16 elements. .NET将字符串表示为UTF-16元素序列。 Unicode code points outside the Base Multilingual Plane (BMP) will be split into a high and low surrogate. 基本多语言平面(BMP)之外的Unicode代码点将分为高和低代理。 The lower 10 bits of each forms half of the real code point value. 每个低10位形成实际代码点值的一半。

There are helpers to detect these surrogates (eg. Char.IsLowSurrogate ). 有助手来检测这些代理人(例如Char.IsLowSurrogate )。

You need to handle this yourself. 你需要自己处理。

There is a solution which seems to work for the input you specified:有一个解决方案似乎适用于您指定的输入:

static string[] SplitIntoTextElements(string input)
{
    IEnumerable<string> Helper()
    {
        for (var en = StringInfo.GetTextElementEnumerator(input); en.MoveNext();)
            yield return en.GetTextElement();
    }
    return Helper().ToArray();
}

Try it here . 在这里试试


PS: This solution should work for .NET 5+, the previous .NET versions contain a bug which prevents the correct splitting. PS:这个解决方案应该适用于 .NET 5+,以前的 .NET 版本包含一个错误,它阻止了正确的拆分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM