简体   繁体   中英

MySQL wrong order of UTF-8 words

When I add UTF-8 words to a table column, and execute an ordered SELECT, the sort order is wrong. On DESC sort, the order is correct but on ASC sort, the order is wrong. How to fix that? Let me explain on example. Lets have a mysql table with Slovak collate:

CREATE TABLE IF NOT EXISTS test (
   aaa varchar(255) CHARACTER SET utf8 COLLATE utf8_slovak_ci NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_slovak_ci;

Now lets insert some values with UTF-8 words:

INSERT INTO test (aaa) VALUES
('Leco'),
('Lečo'),
('Ledo'),
('Chovatelstvo'),
('Chovateľstvo')

Here is Slovak alphabet explained, you can see which letters are after which other letters: http://en.wikipedia.org/wiki/Slovak_orthography

Now when I select with order, I expect to get the following result:

SELECT aaa FROM test ORDER BY aaa ASC
Chovatelstvo
Chovateľstvo
Leco
Lečo
Ledo

And I also expect the exactly opposite order for DESC. But here is what I get in fact:

SELECT aaa FROM test ORDER BY aaa ASC
Chovateľstvo
Chovatelstvo
Leco
Lečo
Ledo

and DESC:

SELECT aaa FROM test ORDER BY aaa DESC
Ledo
Lečo
Leco
Chovateľstvo
Chovatelstvo

You can see there

Chovateľstvo
Chovatelstvo

is always in the given order regardless of ASC or DESC. I noticed that if I insert the rows in opposite order, it may end up as

Chovatelstvo
Chovateľstvo

meaning that the actual order is opposite, but again is the same for ASC and DESC. As like if mysql considered those two letters 'l' and 'ľ' as equal.

I tried this with some older version of MySQL, as well as newest version of MariaDB on another server, the result is the same.

Any idea what causes that and how to fix it?

In both the utf8_slovak_ci and utf8_general_ci collations, the letter ľ and the letter l are considered the same.

You can see this by observing that this query returns true (1)

select _utf8 'Chovateľstvo' collate utf8_slovak_ci = _utf8 'Chovatelstvo'

The designers of that collation obviously believe that ľ and l belong together in the dictionary. The only collations I can find that do not do that are latin2_hungarian_ci and cp1250_czech_cs . But to use either one of those you'll have to change your character set choice.

If you must have them be different, you could try the utf8_bin collation. But that will be entirely case sensitive.

The way ORDER BY works is basically correct for the rules in the collation.

Maybe there's a defect in the collation? You could submit a defect report to the MySql team at https://bugs.mysql.com/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM