为什么 PostgreSQL 对部分区分大小写和部分不区分大小写进行排序？

Question

I cannot understand the behavior of PostgreSQL (v11.10).我无法理解 PostgreSQL (v11.10) 的行为。 Here is what I do:这是我所做的：

create temp table test (first_name text, last_name text);
insert into test values
  ('Hanna', 'Beat'),
  ('JOAN', 'BEET'),
  ('Mark', 'Bernstein'),
  ('ALFRED', 'DOE'),
  ('henry', 'doe'),
  ('Henry', 'Doe'),
  ('Dennis', 'Doe');
select last_name, first_name from test order by last_name, first_name;

This is what I get.这就是我得到的。

 last_name | first_name 
-----------+------------
 Beat      | Hanna
 BEET      | JOAN
 Bernstein | Mark
 doe       | henry
 Doe       | Dennis
 Doe       | Henry
 DOE       | ALFRED
(7 rows)

It looks like the sorting of the first three names is case-insensitive, but for the last four it's case-sensitive.看起来前三个名称的排序不区分大小写，但后四个名称的排序区分大小写。 Why is that so?为什么？

In other words, if the sorting were case-sensitive, I would expect the following order:换句话说，如果排序区分大小写，我希望顺序如下：

 last_name | first_name 
-----------+------------
 Beat      | Hanna
 Bernstein | Mark
 BEET      | JOAN
 doe       | henry
 Doe       | Dennis
 Doe       | Henry
 DOE       | ALFRED
(7 rows)

and if it were case-insensitive, I would expect this:如果它不区分大小写，我希望这样：

 last_name | first_name 
-----------+------------
 Beat      | Hanna
 BEET      | JOAN
 Bernstein | Mark
 DOE       | ALFRED
 Doe       | Dennis
 doe       | henry
 Doe       | Henry
(7 rows)

What I get instead is a mixture of both, and that baffles me...相反，我得到的是两者的混合，这让我感到困惑......

For completeness:为了完整性：

# show lc_collate; show lc_ctype;
 lc_collate  
-------------
 en_US.UTF-8
(1 row)

  lc_ctype   
-------------
 en_US.UTF-8
(1 row)

Answer 1

Natural language collations are more complicated than you think.自然语言排序比你想象的要复杂。 They use different comparison levels , where higher levels are used as tie-breakers when strings compare equal on a lower level.他们使用不同的比较级别，当字符串在较低级别比较相等时，较高级别用作决胜局。 Typically, accents and case are ignored at the primary level.通常，在初级阶段忽略重音和大小写。 At the secondary level, accents are respected, but case is ignored.在中学阶段，重音得到尊重，但大小写被忽略。 On the tertiary level, case and accents are respected.在第三级，大小写和口音受到尊重。

So the strings Etat , état and etat would compare identical on the primary level.因此，字符串Etat 、 état和etat在主要级别上比较相同。 On the secondary level, état would be greater than the other two, which would be equal.在中等水平上， état将大于其他两个，这将是相等的。 On the tertiary level, etat would be less than Etat .在高等教育层面， etat将小于Etat 。 All in all, we end up with总而言之，我们最终得到

'etat' < 'Etat' < 'état'

It is kind of arbitrary that upper case characters are greater than lower case characters, and with ICU collations you can configure most of these aspects.大写字符大于小写字符有点随意，使用 ICU 排序规则，您可以配置大部分这些方面。

In your example, BEET is less than Bernstein on the primary level, so that is the order in which the strings are sorted.在您的示例中， BEET在初级级别上低于Bernstein ，因此这就是字符串排序的顺序。

为什么 PostgreSQL 对部分区分大小写和部分不区分大小写进行排序？

问题描述

1 个解决方案

解决方案1
2 2022-11-23 11:06:42

为什么 PostgreSQL 对部分区分大小写和部分不区分大小写进行排序？

问题描述

1 个解决方案

解决方案1 2 2022-11-23 11:06:42

解决方案1
2 2022-11-23 11:06:42