[英]How can I sort an SQL query but have certain UTF-8 characters be ordered as their normal equivalent? (e.g. É be regarded as E etc)
I have a table of character names in a mySQL database.我在 mySQL 数据库中有一个字符名称表。
I am trying to query the table and sort them alphabetically by name.我正在尝试查询表格并按名称按字母顺序对它们进行排序。
Some of the characters have names like "The Dagda" and the "The " needs to be ignored so I am attempting to use:一些字符的名称如“The Dagda”和“The”需要被忽略,所以我尝试使用:
select character_id, name from characters where is_del=0 order by trim('The ' from name)
Which seems to work...这似乎工作......
Some of the other characters have UTF-8 characters in their names such as "Ériu"其他一些字符的名称中包含 UTF-8 字符,例如“Ériu”
However now when my table is returned I get these "É" entries listed between "A" & "B".但是,现在当我的表被返回时,我会在“A”和“B”之间列出这些“É”条目。
IE: IE:
Aengus Amergin Ériu Balor Banba etc. Aengus Amergin Ériu Balor Banba 等
Preservation of these UTF characters is crucially important on the front end.保留这些 UTF 字符在前端至关重要。
Does anyone know a method where I could have these "É" characters and similar be represented as "E" for purposes of sorting, but will still render in the dataset as what they actually are?有谁知道一种方法,我可以将这些“É”字符和类似字符表示为“E”以进行排序,但仍会在数据集中呈现它们的实际情况?
I am thinking before asking this that this may not be possible but I am hoping someone here might have run into a similar problem before and might have a workaround.在问这个问题之前我在想这可能是不可能的,但我希望这里的某个人之前可能遇到过类似的问题并且可能有解决方法。
Thanks in advance.提前致谢。
EDIT: changed UTF-16 to UTF-8 (my bad)编辑:将 UTF-16 更改为 UTF-8(我的错)
EDIT @Rick James :编辑@Rick James:
I could not format this readably in a comment but the hex of the query is as follows:我无法在评论中以可读的方式格式化它,但查询的十六进制如下:
Aengus Óg |安格斯·格 | 41656E67757320C383E2809C67 41656E67757320C383E2809C67
Amergin |阿美金 | 416D657267696E 416D657267696E
Ériu |らriu | C383E280B0726975 C383E280B0726975
Balor |巴洛尔 | 42616C6F72 42616C6F72
Banba |板坝 | 42616E6261 42616E6261
The 3rd item down is Ériu - I am not sure why they are rendering as above but this is what is being displayed through the phpmyadmin interface when I run the query select character_id, name, hex(name) from characters order by trim('The ' from name)
向下的第 3 项是 Ériu - 我不确定它们为什么会像上面那样呈现,但这是当我运行查询select character_id, name, hex(name) from characters order by trim('The ' from name)
The first character's full name should be Aengus Óg (I am assuming this is again down to character set or collation but I am unsure so apologies for the ignorance on my part here)第一个角色的全名应该是 Aengus Óg (我假设这又归结为字符集或排序规则,但我不确定是否为我在这里的无知而道歉)
"Double encoding" seems to be the problem. “双重编码”似乎是问题所在。 I discuss this somewhat in Trouble with UTF-8 characters;我在UTF-8 字符的麻烦中对此进行了一些讨论; what I see is not what I stored 我看到的不是我存储的
Should `应该`
41 65 6E 67 75 73 20 C383 E2809C 67
Óg
is hex C393 67
in UTF-8. Óg
是 UTF-8 中的十六进制C393 67
。
Latin1 hex C3 93 67
is Óg
Latin1 hex C3 93 67
是Óg
Repeat to get C383 E2809C 67
重复得到C383 E2809C 67
CONVERT(BINARY(CONVERT('Aengus Óg' USING latin1))
USING utf8mb4) --> 'Aengus Óg'
This seems to be "double encoding":这似乎是“双重编码”:
CONVERT(BINARY(CONVERT(CONVERT(UNHEX('C383E280B0726975') USING utf8mb4) USING latin1)) USING utf8mb4) --> 'Ériu' CONVERT(BINARY(CONVERT(CONVERT(UNHEX('C383E280B0726975') USING utf8mb4) USING latin1)) USING utf8mb4) --> 'Ériu'
With Ériu
as an intermediate step.以Ériu
作为中间步骤。 This explains why it sorted with the A's.这解释了为什么它与 A 排序。
This is a common problem.这是一个常见的问题。 It often goes unnoticed because browsers "fix" the mess.它经常被忽视,因为浏览器“修复”了混乱。
Experiment with SELECTs against the table.对表进行 SELECT 试验。 If the first one works for you, then it is just Mojibake.如果第一个适合您,那么它就是 Mojibake。
SELECT CONVERT(BINARY(CONVERT(my_column USING latin1))
USING utf8mb4)
FROM ... WHERE ...;
Read that other Q&A to see what steps went wrong to cause the problem.阅读其他问答,了解哪些步骤出错导致问题。 It likely involves storing UTF-8 characters in a column declared latin1
.它可能涉及将 UTF-8 字符存储在声明为latin1
的列中。
ALTER TABLE ... CONVERT TO ... assumes that the data is correctly stored. ALTER TABLE ... CONVERT TO ...假定数据已正确存储。 But it wasn't.但事实并非如此。 Now you have the CHARACTER SET
correctly set on the columns, but the data in it has been Mojibaked.现在您已在列上正确设置了CHARACTER SET
,但其中的数据已被 Mojibaked。 So, it needs something like所以,它需要类似的东西
UPDATE tbl SET
col1 = CONVERT(BINARY(CONVERT(col1 USING latin1))
USING utf8mb4),
col2 = CONVERT(BINARY(CONVERT(col2 USING latin1))
USING utf8mb4),
...
;
More on the fix:http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases有关修复的更多信息:http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
Rollback?回滚? If you are more comfortable rolling back to before the CONVERT TO, then ignore most of what I said before, then you need the 2-step ALTER after the rollback.如果您更愿意回滚到 CONVERT TO 之前,那么忽略我之前所说的大部分内容,那么您需要在回滚之后进行 2 步 ALTER。 (See that blog link.) (请参阅该博客链接。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.