简体   繁体   中英

Same Chinese text not passing equality test

I am performing tests using the following two Chinese strings:

‎‎中國哲學書電子化計劃

...and...

中國哲學書電子化計劃

They look absolutely identical, but they're not. The following tests were performed in the Immediate window:

"‎‎中國哲學書電子化計劃" == "中國哲學書電子化計劃"
false
"‎‎中國哲學書電子化計劃".Length + " " + "中國哲學書電子化計劃".Length
"12 10"

Also:

"‎‎中國哲學書電子化計劃"[0]
8206 '‎'
"中國哲學書電子化計劃"[0]
20013 '中'

I think this may have something to do with surrogate pairs but I haven't understood why this happens. I'm finding it very strange that you can represent exactly the same text in Chinese using different binary representations. Can anyone kindly explain this phenomenon?

You have control characters in there, so you need to use an InvariantCulture parameter when comparing them.

Look at this example:

var str1 = "‎‎中國哲學書電子化計劃";
var str2 = "中國哲學書電子化計劃";

Console.WriteLine("str1 == str2 -> {0}", str1 == str2);
Console.WriteLine("str1 == str2 -> {0}", str1.Equals(str2,StringComparison.InvariantCulture));

Will give you the following output:

str1 == str2 -> False
str1 == str2 -> True

As pointed out in another good answer here, code 8206 is a LEFT-TO-RIGHT mark. More information can be found here .

InvariantCulture comparisons disregard such control codes. More information can be found here . In contrast, Ordinal comparisons (the default) work at byte level.

If you want to 'sanitize' your strings from any control characters, you do not need to iterate over every character, instead, ReGex comes to your aid, like so:

var cleanString = Regex.Replace(dirtyString, @"\p{C}+", string.Empty);

The character with code 8206 in decimal is U+200E LEFT-TO-RIGHT MARK, and there are two copies of that character at the start of the first string. This explains the results.

What you should do depends on what the data comes from and what will be done with it. U+200E as such should not cause harm, and it may be needed in some situations, but the odds are that it is unintentional here. If such characters may appear in the data, you should ask what other control characters might appear there and what should be done with them. It may be suitable to remove them, or you might need to just do comparisons in a manner that ignores them (eg, internally constructing copies of the strings with control characters removed and then comparing them).

This specific issue has nothing to do with surrogate pairs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM