简体   繁体   English

相同中文文本未通过相等性测试

[英]Same Chinese text not passing equality test

I am performing tests using the following two Chinese strings: 我正在使用以下两个中文字符串执行测试:

‎‎中國哲學書電子化計劃

...and... ...和...

中國哲學書電子化計劃

They look absolutely identical, but they're not. 它们看起来完全相同,但事实并非完全相同。 The following tests were performed in the Immediate window: 在立即窗口中执行以下测试:

"‎‎中國哲學書電子化計劃" == "中國哲學書電子化計劃"
false
"‎‎中國哲學書電子化計劃".Length + " " + "中國哲學書電子化計劃".Length
"12 10"

Also: 也:

"‎‎中國哲學書電子化計劃"[0]
8206 '‎'
"中國哲學書電子化計劃"[0]
20013 '中'

I think this may have something to do with surrogate pairs but I haven't understood why this happens. 我认为这可能与代理对有关,但我不知道为什么会发生这种情况。 I'm finding it very strange that you can represent exactly the same text in Chinese using different binary representations. 我感到很奇怪,您可以使用不同的二进制表示形式来表示完全相同的中文文本。 Can anyone kindly explain this phenomenon? 谁能解释这个现象?

You have control characters in there, so you need to use an InvariantCulture parameter when comparing them. 您在其中有控制字符,因此在比较它们时需要使用InvariantCulture参数。

Look at this example: 看这个例子:

var str1 = "‎‎中國哲學書電子化計劃";
var str2 = "中國哲學書電子化計劃";

Console.WriteLine("str1 == str2 -> {0}", str1 == str2);
Console.WriteLine("str1 == str2 -> {0}", str1.Equals(str2,StringComparison.InvariantCulture));

Will give you the following output: 将为您提供以下输出:

str1 == str2 -> False
str1 == str2 -> True

As pointed out in another good answer here, code 8206 is a LEFT-TO-RIGHT mark. 如此处另一个好的答案所指出的那样,代码8206是从左到右的标记。 More information can be found here . 可以在此处找到更多信息。

InvariantCulture comparisons disregard such control codes. InvariantCulture比较不考虑此类控制代码。 More information can be found here . 可以在此处找到更多信息。 In contrast, Ordinal comparisons (the default) work at byte level. 相反,序数比较(默认)在字节级别进行。

If you want to 'sanitize' your strings from any control characters, you do not need to iterate over every character, instead, ReGex comes to your aid, like so: 如果要从任何控制字符中“清理”字符串,则无需遍历每个字符,相反,ReGex会助您一臂之力,如下所示:

var cleanString = Regex.Replace(dirtyString, @"\p{C}+", string.Empty);

The character with code 8206 in decimal is U+200E LEFT-TO-RIGHT MARK, and there are two copies of that character at the start of the first string. 十进制代码8206的字符为U + 200E LEFT-TO-RIGHT MARK,并且在第一个字符串的开头有该字符的两个副本。 This explains the results. 这解释了结果。

What you should do depends on what the data comes from and what will be done with it. 您应该做什么取决于数据来自什么以及将如何处理。 U+200E as such should not cause harm, and it may be needed in some situations, but the odds are that it is unintentional here. 这样的U + 200E不会造成伤害,在某些情况下可能会造成伤害,但很有可能在这里不是故意的。 If such characters may appear in the data, you should ask what other control characters might appear there and what should be done with them. 如果这些字符可能出现在数据中,则应询问那里还有哪些其他控制字符以及应如何处理。 It may be suitable to remove them, or you might need to just do comparisons in a manner that ignores them (eg, internally constructing copies of the strings with control characters removed and then comparing them). 删除它们可能比较合适,或者您可能需要以忽略它们的方式进行比较(例如,在内部构造字符串副本并删除控制字符,然后进行比较)。

This specific issue has nothing to do with surrogate pairs. 此特定问题与代理对无关。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM