简体   繁体   English

为什么string.Compare似乎不一致地处理重音字符?

[英]Why does string.Compare seem to handle accented characters inconsistently?

If I execute the following statement: 如果我执行以下语句:

string.Compare("mun", "mün", true, CultureInfo.InvariantCulture)

The result is '-1', indicating that 'mun' has a lower numeric value than 'mün'. 结果为'-1',表示'mun'的数值低于'mün'。

However, if I execute this statement: 但是,如果我执行此语句:

string.Compare("Muntelier, Schweiz", "München, Deutschland", true, CultureInfo.InvariantCulture)

I get '1', indicating that 'Muntelier, Schewiz' should go last. 我得到'1',表明'Muntelier,Schewiz'应该排在最后。

Is this a bug in the comparison? 这是比较中的错误吗? Or, more likely, is there a rule I should be taking into account when sorting strings containing accented 或者,更有可能的是,在排序包含重音的字符串时,我应该考虑一个规则


The reason this is an issue is, I'm sorting a list and then doing a manual binary filter that's meant to get every string beginning with 'xxx'. 这是一个问题的原因是,我正在排序一个列表,然后做一个手动二进制过滤器,意味着让每个字符串以'xxx'开头。

Previously I was using the Linq 'Where' method, but now I have to use this custom function written by another person, because he says it performs better. 以前我使用的是Linq'Fhere'方法,但现在我必须使用另一个人编写的这个自定义函数,因为他说它表现得更好。

But the custom function doesn't seem to take into account whatever 'unicode' rules .NET has. 但是自定义函数似乎没有考虑.NET具有的“unicode”规则。 So if I tell it to filter by 'mün', it doesn't find any items, even though there are items in the list beginning with 'mun'. 因此,如果我告诉它过滤'mün',它就找不到任何项目,即使列表中的项目以'mun'开头。

This seems to be because of the inconsistent ordering of accented characters, depending on what characters go after the accented character. 这似乎是因为重音字符的顺序不一致,这取决于重音字符后面的字符。


OK, I think I've fixed the problem. 好的,我想我已经解决了这个问题。

Before the filter, I do a sort based on the first n letters of each string, where n is the length of the search string. 在过滤器之前,我根据每个字符串的前n个字母进行排序,其中n是搜索字符串的长度。

There is a tie-breaking algorithm at work, see http://unicode.org/reports/tr10/ 工作中有一个打破平局的算法,请参阅http://unicode.org/reports/tr10/

To address the complexities of language-sensitive sorting, a multilevel comparison algorithm is employed. 为了解决语言敏感排序的复杂性,采用了多级比较算法。 In comparing two words, for example, the most important feature is the base character: such as the difference between an A and a B. Accent differences are typically ignored, if there are any differences in the base letters. 例如,在比较两个单词时,最重要的特征是基本字符:例如A和B之间的差异。如果基本字母有任何差异,则通常会忽略重音差异。 Case differences (uppercase versus lowercase), are typically ignored, if there are any differences in the base or accents. 如果基数或重音有任何差异,则通常会忽略大小写差异(大写与小写)。 Punctuation is variable. 标点符号是可变的。 In some situations a punctuation character is treated like a base character. 在某些情况下,标点符号被视为基本字符。 In other situations, it should be ignored if there are any base, accent, or case differences. 在其他情况下,如果存在任何基础,重音或大小写差异,则应忽略它。 There may also be a final, tie-breaking level, whereby if there are no other differences at all in the string, the (normalized) code point order is used. 也可能存在最终的打破平局级别,如果字符串中根本没有其他差异,则使用(标准化的)代码点顺序。

So, "Munt..." and "Münc..." are alphabetically different and sort based on the "t" and "c". 因此,“Munt ......”和“Münc...”按字母顺序不同,并根据“t”和“c”排序。

Whereas, "mun" and "mün" are alphabetically the same ("u" equivelent to "ü" in lost languages) so the character codes are compared 然而,“mun”和“mün”在字母上是相同的(“u”等于“ü”在丢失的语言中)所以比较字符代码

It looks like the accented character is only being used in a sort of "tie-break" situation - in other words, if the strings are otherwise equal. 看起来重音字符只用于某种“打破平局”的情况 - 换句话说,如果字符串在其他方面是相同的。

Here's some sample code to demonstrate: 以下是一些示例代码:

using System;
using System.Globalization;

class Test
{
    static void Main()
    {
        Compare("mun", "mün");
        Compare("muna", "münb");
        Compare("munb", "müna");
    }

    static void Compare(string x, string y)
    {
        int result = string.Compare(x, y, true, 
                                   CultureInfo.InvariantCulture));

        Console.WriteLine("{0}; {1}; {2}", x, y, result);
    }
}

(I've tried adding a space after the "n" as well, to see if it was done on word boundaries - it isn't.) (我也尝试在“n”之后添加一个空格,看它是否在字边界上完成 - 它不是。)

Results: 结果:

mun; mün; -1
muna; münb; -1
munb; müna; 1

I suspect this is correct by various complicated Unicode rules - but I don't know enough about them. 我怀疑各种复杂的Unicode规则是正确的 - 但我对它们知之甚少。

As for whether you need to take this into account... I wouldn't expect so. 至于你是否需要考虑到这一点......我不希望如此。 What are you doing that is thrown by this? 这是怎么回事?

As I understand this it is still somewhat consistent. 据我了解,它仍然有些一致。 When comparing using CultureInfo.InvariantCulture the umlaut character ü is treated like the non-accented character u . 当使用CultureInfo.InvariantCulture进行比较时,变音字符ü被视为非重音字符u

As the strings in your first example obviously are not equal the result will not be 0 but -1 (which seems to be a default value). 由于第一个示例中的字符串显然不相等,结果将不是0而是-1(这似乎是默认值)。 In the second example Muntelier goes last because t follows c in the alphabet. 在第二个例子中, Muntelier排在最后,因为t跟随字母表中的c

I couldn't find any clear documentation in MSDN explaining these rules, but I found that 我在MSDN中找不到任何明确的文档解释这些规则,但我发现了

string.Compare("mun", "mün", CultureInfo.InvariantCulture,  
    CompareOptions.StringSort);

and

string.Compare("Muntelier, Schweiz", "München, Deutschland", 
    CultureInfo.InvariantCulture, CompareOptions.StringSort);

gives the desired result. 给出了期望的结果。

Anyway, I think you'd be better off to base your sorting on a specific culture such as the current user's culture (if possible). 无论如何,我认为你最好根据当前用户的文化(如果可能的话)对特定文化进行排序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM