简体   繁体   中英

Why does PostgreSQL sort part case-sensitively and part case-insensitively?

I cannot understand the behavior of PostgreSQL (v11.10). Here is what I do:

create temp table test (first_name text, last_name text);
insert into test values
  ('Hanna', 'Beat'),
  ('JOAN', 'BEET'),
  ('Mark', 'Bernstein'),
  ('ALFRED', 'DOE'),
  ('henry', 'doe'),
  ('Henry', 'Doe'),
  ('Dennis', 'Doe');
select last_name, first_name from test order by last_name, first_name;

This is what I get.

 last_name | first_name 
-----------+------------
 Beat      | Hanna
 BEET      | JOAN
 Bernstein | Mark
 doe       | henry
 Doe       | Dennis
 Doe       | Henry
 DOE       | ALFRED
(7 rows)

It looks like the sorting of the first three names is case-insensitive, but for the last four it's case-sensitive. Why is that so?

In other words, if the sorting were case-sensitive, I would expect the following order:

 last_name | first_name 
-----------+------------
 Beat      | Hanna
 Bernstein | Mark
 BEET      | JOAN
 doe       | henry
 Doe       | Dennis
 Doe       | Henry
 DOE       | ALFRED
(7 rows)

and if it were case-insensitive, I would expect this:

 last_name | first_name 
-----------+------------
 Beat      | Hanna
 BEET      | JOAN
 Bernstein | Mark
 DOE       | ALFRED
 Doe       | Dennis
 doe       | henry
 Doe       | Henry
(7 rows)

What I get instead is a mixture of both, and that baffles me...

For completeness:

# show lc_collate; show lc_ctype;
 lc_collate  
-------------
 en_US.UTF-8
(1 row)

  lc_ctype   
-------------
 en_US.UTF-8
(1 row)

Natural language collations are more complicated than you think. They use different comparison levels , where higher levels are used as tie-breakers when strings compare equal on a lower level. Typically, accents and case are ignored at the primary level. At the secondary level, accents are respected, but case is ignored. On the tertiary level, case and accents are respected.

So the strings Etat , état and etat would compare identical on the primary level. On the secondary level, état would be greater than the other two, which would be equal. On the tertiary level, etat would be less than Etat . All in all, we end up with

'etat' < 'Etat' < 'état'

It is kind of arbitrary that upper case characters are greater than lower case characters, and with ICU collations you can configure most of these aspects.

In your example, BEET is less than Bernstein on the primary level, so that is the order in which the strings are sorted.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM