[英]How to recognize if a string contains unicode chars?
I have a string and I want to know if it has unicode characters inside or not. 我有一个字符串,我想知道它内部是否有unicode字符。 (if its fully contains ASCII or not) (如果它完全包含ASCII或不包含ASCII)
How can I achieve that? 我怎样才能做到这一点?
Thanks! 谢谢!
If my assumptions are correct you wish to know if your string contains any "non-ANSI" characters. 如果我的假设是正确的,您希望知道您的字符串是否包含任何“非ANSI”字符。 You can derive this as follows. 您可以如下推导出这个。
public void test()
{
const string WithUnicodeCharacter = "a hebrew character:\uFB2F";
const string WithoutUnicodeCharacter = "an ANSI character:Æ";
bool hasUnicode;
//true
hasUnicode = ContainsUnicodeCharacter(WithUnicodeCharacter);
Console.WriteLine(hasUnicode);
//false
hasUnicode = ContainsUnicodeCharacter(WithoutUnicodeCharacter);
Console.WriteLine(hasUnicode);
}
public bool ContainsUnicodeCharacter(string input)
{
const int MaxAnsiCode = 255;
return input.Any(c => c > MaxAnsiCode);
}
Update 更新
This will detect for extended ASCII. 这将检测扩展的ASCII。 If you only detect for the true ASCII character range (up to 127), then you could potentially get false positives for extended ASCII characters which does not denote Unicode. 如果您只检测真正的ASCII字符范围(最多127个),那么您可能会获得不表示Unicode的扩展ASCII字符的误报。 I have alluded to this in my sample. 我在我的样本中提到了这一点。
If a string contains only ASCII characters, a serialization + deserialization step using ASCII encoding should get back the same string so a one liner check in c# could look like.. 如果一个字符串只包含ASCII字符,那么使用ASCII编码的序列化+反序列化步骤应该返回相同的字符串,因此c#中的单行检查可能看起来像..
String s1="testभारत";
bool isUnicode= System.Text.ASCIIEncoding.GetEncoding(0).GetString(System.Text.ASCIIEncoding.GetEncoding(0).GetBytes(s1)) != s1;
ASCII
defines only character codes in the range 0-127
. ASCII
仅定义0-127
范围内的字符代码。 Unicode
is explicitly defined such as to overlap in that same range with ASCII. 明确定义Unicode
例如在ASCII的相同范围内重叠。 Thus, if you look at the character codes in your string, and it contains anything that is higher than 127, the string contains Unicode characters that are not ASCII characters. 因此,如果查看字符串中的字符代码,并且它包含任何高于127的字符,则该字符串包含非ASCII字符的Unicode字符。
Note, that ASCII includes only the English alphabet. 注意,ASCII仅包括英文字母。 Thus, if you (for whatever reason) need to apply that same approach to strings that might contain accented characters (Spanish text for example), ASCII is not sufficient and you need to look for another differentiator. 因此,如果您(出于某种原因)需要将相同的方法应用于可能包含重音字符(例如西班牙语文本)的字符串,则ASCII不够,您需要寻找另一个区分因素。
ANSI
character set [*] does extends the ASCII characters with the aforementioned accented Latin characters in the range 128-255
. ANSI
字符集[*]确实使用上述重音拉丁字符扩展ASCII字符,范围为128-255
。 However, Unicode does not overlap with ANSI in that range, so technically an Unicode string might contain characters that are not part of ANSI, but have the same character code (specifically in the range 128-159
, as you can see from the table I linked to). 但是,Unicode在该范围内不与ANSI重叠,因此从技术上讲,Unicode字符串可能包含不属于ANSI的字符,但具有相同的字符代码(特别是在128-159
范围内,如表I所示)链接到)。
As for the actual code to do this, @chibacity answer should work, although you should modify it to cover strict ASCII, because it won't work for ANSI. 至于执行此操作的实际代码,@ machib应答应该有效,尽管您应该修改它以涵盖严格的ASCII,因为它不适用于ANSI。
[*] Also known as Latin 1 Windows (Win-1252) [*]也称为Latin 1 Windows(Win-1252)
所有C#
/ VB.NET
string
数据类型都包含Unicode字符。
As long as it contains characters , it contains Unicode characters. 只要它包含字符 ,它就包含Unicode字符。
From System.String
: 来自System.String
:
Represents text as a series of Unicode characters. 将文本表示为一系列Unicode字符。
public static bool ContainsUnicodeChars(string text)
{
return !string.IsNullOrEmpty(text);
}
You normally have to worry about different Unicode encodings when you have to: 您必须在以下情况下担心不同的Unicode编码:
Once you're into string land though, the encoding that the string was originally represented with, if any, is irrelevant. 一旦你进入字符串域,字符串最初表示的编码(如果有的话)是无关紧要的。
Each character in a string is defined by a Unicode scalar value, also called a Unicode code point or the ordinal (numeric) value of the Unicode character. 字符串中的每个字符都由Unicode标量值定义,也称为Unicode代码点或Unicode字符的序数(数字)值。 Each code point is encoded by using UTF-16 encoding , and the numeric value of each element of the encoding is represented by a Char object. 每个代码点使用UTF-16编码进行编码 , 编码的每个元素的数值由Char对象表示。
Perhaps you might also find these questions relevant: 也许您也可能会发现这些问题相关:
How can you strip non-ASCII characters from a string? 如何从字符串中删除非ASCII字符? (in C#) (在C#中)
C# Ensure string contains only ASCII C#确保字符串仅包含ASCII
And this article by Jon Skeet: Unicode and .NET 这篇文章由Jon Skeet撰写: Unicode和.NET
This is another solution without using lambda expresions. 这是另一种不使用lambda表达式的解决方案。 It is in VB.NET but you can convert it easily to C#: 它在VB.NET中,但您可以轻松地将其转换为C#:
Public Function ContainsUnicode(ByVal inputstr As String) As Boolean
Dim inputCharArray() As Char = inputstr.ToCharArray
For i As Integer = 0 To inputCharArray.Length - 1
If CInt(AscW(inputCharArray(i))) > 255 Then Return True
Next
Return False
End Function
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.