简体   繁体   English

如何识别字符串是否包含unicode字符?

[英]How to recognize if a string contains unicode chars?

I have a string and I want to know if it has unicode characters inside or not. 我有一个字符串,我想知道它内部是否有unicode字符。 (if its fully contains ASCII or not) (如果它完全包含ASCII或不包含ASCII)

How can I achieve that? 我怎样才能做到这一点?

Thanks! 谢谢!

If my assumptions are correct you wish to know if your string contains any "non-ANSI" characters. 如果我的假设是正确的,您希望知道您的字符串是否包含任何“非ANSI”字符。 You can derive this as follows. 您可以如下推导出这个。

    public void test()
    {
        const string WithUnicodeCharacter = "a hebrew character:\uFB2F";
        const string WithoutUnicodeCharacter = "an ANSI character:Æ";

        bool hasUnicode;

        //true
        hasUnicode = ContainsUnicodeCharacter(WithUnicodeCharacter);
        Console.WriteLine(hasUnicode);

        //false
        hasUnicode = ContainsUnicodeCharacter(WithoutUnicodeCharacter);
        Console.WriteLine(hasUnicode);
    }

    public bool ContainsUnicodeCharacter(string input)
    {
        const int MaxAnsiCode = 255;

        return input.Any(c => c > MaxAnsiCode);
    }

Update 更新

This will detect for extended ASCII. 这将检测扩展的ASCII。 If you only detect for the true ASCII character range (up to 127), then you could potentially get false positives for extended ASCII characters which does not denote Unicode. 如果您只检测真正的ASCII字符范围(最多127个),那么您可能会获得不表示Unicode的扩展ASCII字符的误报。 I have alluded to this in my sample. 我在我的样本中提到了这一点。

If a string contains only ASCII characters, a serialization + deserialization step using ASCII encoding should get back the same string so a one liner check in c# could look like.. 如果一个字符串只包含ASCII字符,那么使用ASCII编码的序列化+反序列化步骤应该返回相同的字符串,因此c#中的单行检查可能看起来像..

String s1="testभारत";
bool isUnicode= System.Text.ASCIIEncoding.GetEncoding(0).GetString(System.Text.ASCIIEncoding.GetEncoding(0).GetBytes(s1)) != s1;

ASCII defines only character codes in the range 0-127 . ASCII仅定义0-127范围内的字符代码。 Unicode is explicitly defined such as to overlap in that same range with ASCII. 明确定义Unicode例如在ASCII的相同范围内重叠。 Thus, if you look at the character codes in your string, and it contains anything that is higher than 127, the string contains Unicode characters that are not ASCII characters. 因此,如果查看字符串中的字符代码,并且它包含任何高于127的字符,则该字符串包含非ASCII字符的Unicode字符。

Note, that ASCII includes only the English alphabet. 注意,ASCII仅包括英文字母。 Thus, if you (for whatever reason) need to apply that same approach to strings that might contain accented characters (Spanish text for example), ASCII is not sufficient and you need to look for another differentiator. 因此,如果您(出于某种原因)需要将相同的方法应用于可能包含重音字符(例如西班牙语文本)的字符串,则ASCII不够,您需要寻找另一个区分因素。

ANSI character set [*] does extends the ASCII characters with the aforementioned accented Latin characters in the range 128-255 . ANSI字符集[*]确实使用上述重音拉丁字符扩展ASCII字符,范围为128-255 However, Unicode does not overlap with ANSI in that range, so technically an Unicode string might contain characters that are not part of ANSI, but have the same character code (specifically in the range 128-159 , as you can see from the table I linked to). 但是,Unicode在该范围内不与ANSI重叠,因此从技术上讲,Unicode字符串可能包含不属于ANSI的字符,但具有相同的字符代码(特别是在128-159范围内,如表I所示)链接到)。

As for the actual code to do this, @chibacity answer should work, although you should modify it to cover strict ASCII, because it won't work for ANSI. 至于执行此操作的实际代码,@ machib应答应该有效,尽管您应该修改它以涵盖严格的ASCII,因为它不适用于ANSI。

[*] Also known as Latin 1 Windows (Win-1252) [*]也称为Latin 1 Windows(Win-1252)

所有C# / VB.NET string数据类型都包含Unicode字符。

As long as it contains characters , it contains Unicode characters. 只要它包含字符 ,它就包含Unicode字符。

From System.String : 来自System.String

Represents text as a series of Unicode characters. 将文本表示为一系列Unicode字符。

public static bool ContainsUnicodeChars(string text)
{
   return !string.IsNullOrEmpty(text);
}

You normally have to worry about different Unicode encodings when you have to: 您必须在以下情况下担心不同的Unicode编码:

  1. Encode a string into a stream of bytes with a particular encoding. 编码的字符串转换成字节的与特定的编码流。
  2. Decode a string from a stream of bytes with a particular encoding. 使用特定编码字节流中解码字符串。

Once you're into string land though, the encoding that the string was originally represented with, if any, is irrelevant. 一旦你进入字符串域,字符串最初表示的编码(如果有的话)是无关紧要的。

Each character in a string is defined by a Unicode scalar value, also called a Unicode code point or the ordinal (numeric) value of the Unicode character. 字符串中的每个字符都由Unicode标量值定义,也称为Unicode代码点或Unicode字符的序数(数字)值。 Each code point is encoded by using UTF-16 encoding , and the numeric value of each element of the encoding is represented by a Char object. 每个代码点使用UTF-16编码进行编码编码的每个元素的数值由Char对象表示。

Perhaps you might also find these questions relevant: 也许您也可能会发现这些问题相关:

How can you strip non-ASCII characters from a string? 如何从字符串中删除非ASCII字符? (in C#) (在C#中)

C# Ensure string contains only ASCII C#确保字符串仅包含ASCII

And this article by Jon Skeet: Unicode and .NET 这篇文章由Jon Skeet撰写: Unicode和.NET

This is another solution without using lambda expresions. 这是另一种不使用lambda表达式的解决方案。 It is in VB.NET but you can convert it easily to C#: 它在VB.NET中,但您可以轻松地将其转换为C#:

   Public Function ContainsUnicode(ByVal inputstr As String) As Boolean
        Dim inputCharArray() As Char = inputstr.ToCharArray

        For i As Integer = 0 To inputCharArray.Length - 1
            If CInt(AscW(inputCharArray(i))) > 255 Then Return True
        Next
        Return False
   End Function

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM