简体   繁体   English

Bash 字符串字典序比较不一致

[英]Bash string lexicographical comparisons inconsistency

Bash manual section 6.4 describes [[ string1 < string2 ]] as Bash 手册第 6.4 节将 [[ string1 < string2 ]] 描述为

True if string1 sorts after string2 lexicographically in the current locale.如果 string1 在当前语言环境中按字典顺序排列在 string2 之后,则为 True。

I am using a stock English language Linux and was expecting my current locale is ASCII where period [.] is lexicographically less than [0-9A-Za-z].我正在使用股票英语 Linux,并期望我当前的语言环境是 ASCII,其中句点 [.] 在字典上小于 [0-9A-Za-z]。 However, take a look at these:然而,看看这些:

$ echo $BASH_VERSION
4.3.11(1)-release
$ [[ "." < "1" ]] && echo "yes"
yes
$ [[ "A" < "B" ]] && echo "yes"
yes
$ [[ ".A" < "1B" ]] && echo "yes"
$

The 1st and 2nd comparison agree with the ASCII table, but why the 3rd one false?第 1 次和第 2 次比较与 ASCII 表一致,但为什么第 3 次是错误的? What exactly is this lexicographical sort order?这个字典排序顺序到底是什么?

Here is the output of locale:这是语言环境的输出:

$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

This doesn't have much to do with your shell.这与您的外壳没有太大关系。 To perform a locale-dependent lexicographic comparison of .A and 1B , bash simply calls strcoll(".A", "1B") , and interprets the return value, that's all.要执行.A1B的语言环境相关词典比较,bash 只需调用strcoll(".A", "1B") ,并解释返回值,仅此而已。

    {
#if defined (HAVE_STRCOLL)
      if (shell_compatibility_level > 40 && flags & TEST_LOCALE)
    return ((op[0] == '>') ? (strcoll (arg1, arg2) > 0) : (strcoll (arg1, arg2) < 0));
      else
#endif
    return ((op[0] == '>') ? (strcmp (arg1, arg2) > 0) : (strcmp (arg1, arg2) < 0));
    }

(copied from test.c ) (从test.c复制)

Above excerpt also reveals that in order to force a byte-by-byte comparison without altering locale settings, one needs to change the shell compatibility level to 40 (which stands for 4.0, the last version of bash which behaves the way you expected by default).上面的摘录还表明,为了在不改变区域设置的情况下强制逐字节比较,需要将 shell 兼容级别更改为40 (代表 4.0,bash 的最后一个版本,默认情况下的行为方式与您预期的一样)。

$ shopt -s compat40
$ [[ .A < 1B ]] && echo yes
yes
$ 

Now, as to your question ( The 1st and 2nd comparison agree with the ASCII table, but why the 3rd one false? What exactly is this lexicographical sort order? ), well, it's your locale's collation order apparently.现在,至于您的问题(第 1 次和第 2 次比较与 ASCII 表一致,但为什么第 3 次是错误的?这个字典排序顺序究竟是什么? ),好吧,这显然是您的语言环境的整理顺序。 Under What Collation is NOT , UCA specification says:What Collat​​ion is NOT 下,UCA 规范说:

Collation order is not preserved under concatenation or substring operations, in general.通常,在串联或子字符串操作下不保留整理顺序。

For example, the fact that x is less than y does not mean that x + z is less than y + z , because characters may form contractions across the substring or concatenation boundaries.例如, x小于y 的事实并不意味着x + z小于y + z ,因为字符可能会形成跨子串或串联边界的收缩。 In summary:总之:

x < y does not imply that xz < yz x < y并不意味着xz < yz
x < y does not imply that zx < zy x < y并不意味着zx < zy
xz < yz does not imply that x < y xz < yz并不意味着x < y
zx < zy does not imply that x < y zx < zy并不意味着x < y

Which, I think, corroborates that this is not a bug but a feature.我认为,这证实了这不是错误而是功能。

UTF-8 collation order doesn't go character-by-character, like traditional ASCIIbetical collation does. UTF-8 整理顺序不会像传统的 ASCIIbetical 整理那样逐个字符进行。 It uses a multi-level comparison , in which some types of differences are prioritized over others even if they occur later in the string .它使用多级比较,其中某些类型的差异优先于其他类型,即使它们出现在字符串的后面 In this case, what you're seeing the result of "Base character" order ("A" < "1B") being prioritized over a punctuation difference.在这种情况下,您看到的“基本字符”顺序(“A”<“1B”)的结果优先于标点符号差异。 Here's a quote from the standard:这是标准中的引用:

To address the complexities of language-sensitive sorting, a multilevel comparison algorithm is employed.为了解决语言敏感排序的复杂性,采用了多级比较算法。 In comparing two words, the most important feature is the identity of the base letters—for example, the difference between an A and a B. Accent differences are typically ignored, if the base letters differ.在比较两个单词时,最重要的特征是基本字母的同一性——例如,A 和 B 之间的差异。如果基本字母不同,通常会忽略重音差异。 Case differences (uppercase versus lowercase), are typically ignored, if the base letters or their accents differ.如果基本字母或其重音不同,则通常会忽略大小写差异(大写与小写)。 Treatment of punctuation varies.标点符号的处理各不相同。 In some situations a punctuation character is treated like a base letter.在某些情况下,标点符号被视为基本字母。 In other situations, it should be ignored if there are any base, accent, or case differences.在其他情况下,如果有任何基本、重音或大小写差异,则应忽略它。 [...] [...]

Here's an example showing the prioritization of punctuation vs "base characters":这是一个示例,显示标点符号与“基本字符”的优先级:

$ printf '%s\n' {,.,-}{,1,A,AB,B,BA} | LANG=en_US.UTF-8 sort
-
.
-1
.1
1
-A
.A
A
-AB
.AB
AB
-B
.B
B
-BA
.BA
BA

Note that the punctuation only matters to break ties between lines containing the same base characters.请注意,标点符号仅对打破包含相同基本字符的行之间的联系很重要。 You can also see similar effects involving capitalization and accents:您还可以看到涉及大写和重音的类似效果:

printf '%s\n' {a,A,B}{A,Å,B} | LANG=en_US.UTF-8 sort
aA
AA
aÅ
AÅ
aB
AB
BA
BÅ
BB

Note that the accent on the second character has higher priority than the capitalization of the first character (and punctuation anywhere in the string would have lower priority than either).请注意,第二个字符的重音优先级高于第一个字符的大写(并且字符串中任何地方的标点符号的优先级都低于两者)。

(And, of course, there are lots of other complications beyond this.) (当然,除此之外还有许多其他并发症。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM