简体   繁体   English

MySQL用'='选择UTF-8字符串但不用'LIKE'

[英]MySQL select UTF-8 string with '=' but not with 'LIKE'

I have a table with some words that come from medieval books and have some accented letters that doesn't exists anymore in modern latin1 alphabet. 我有一张桌子,里面有一些来自中世纪书籍的文字,并且有一些重音字母,在现代拉丁字母表中不再存在。 I can represent these letters easily with UTF-8 combining characters. 我可以使用UTF-8组合字符轻松表示这些字母。 For example, to create a "J" with a tilde, I use the UTF-8 sequence \J+\̃ and the J becomes accented with a tilde. 例如,要创建一个带波形符的“J”,我使用UTF-8序列\\ u004A + \\ u0303,J用波浪号重音。

The table uses utf8 encoding and the field collation is utf8_unicode_ci. 该表使用utf8编码,字段排序规则为utf8_unicode_ci。

My problem is the following: If I try to select the entire string, I receive the correct answer. 我的问题如下:如果我尝试选择整个字符串,我会收到正确的答案。 If I try to select using 'LIKE', I receive the wrong answer. 如果我尝试选择使用'LIKE',我会收到错误的答案。

For example: 例如:

mysql> select word, hex(word) from oldword where word = 'hua';
+--------+--------------+
| word   | hex(word)    |
+--------+--------------+
| hũa    | 6875CC8361   |
| huã    | 6875C3A3     |
| hua    | 687561       |
| hũã    | 6875CC83C3A3 |
+--------+--------------+
4 rows in set (0,04 sec)

mysql> select word, hex(word) from oldword where word like 'hua';
+-------+------------+
| word  | hex(word)  |
+-------+------------+
| huã   | 6875C3A3   |
| hua   | 687561     |
+-------+------------+
2 rows in set (0,04 sec)

I don't want to search only the entire word. 我不想只搜索整个单词。 I want to search words that start with some substring. 我想搜索以某些子字符串开头的单词。 Eventually the searched word is the entire word. 最终搜索到的单词是整个单词。

How could I select the partial string using like and match all the strings? 如何使用like选择部分字符串并匹配所有字符串?

I tried to create a custom collation using this information , but the server became unstable and only after a lot of trials and errors I was able to revert to the utf8_unicode_ci collation again and the server returned to normal condition. 我尝试使用此信息创建自定义排序规则,但服务器变得不稳定,只有经过大量试验和错误后,我才能再次恢复到utf8_unicode_ci排序规则并且服务器恢复正常状态。

EDIT: There's a problem with this site and some characters don't display correctly. 编辑:这个网站有问题,一些字符无法正确显示。 Please see the results on these pastebins: 请查看这些pastebins的结果:

http://pastebin.com/mckJTLFX http://pastebin.com/mckJTLFX

http://pastebin.com/WP87QvgB http://pastebin.com/WP87QvgB

After seeing Marcus Adams' answer I realized that the REPLACE function could be the solution for this problem, although he didn't mentioned this function. 在看到Marcus Adams的回答后,我意识到REPLACE功能可能是解决这个问题的方法,尽管他没有提到这个功能。

As I have only two different combining characters (acute and tilde), combined with other ASCII characters, for example j with tilde, j with acute, m with tilde, s with tilde, and so on. 因为我只有两个不同的组合字符(锐角和波形符号),与其他ASCII字符组合,例如j代表波浪号,j代表尖锐,m代表波浪号,s代表波浪号,等等。 I just have to replace these two characters when using LIKE. 我只需要在使用LIKE时替换这两个字符。

After searching the manual, I learned about the UNHEX function that helped me to properly represent the combining characters alone in the query to remove them. 在查阅手册后,我了解了UNHEX函数,它帮助我在查询中正确表示组合字符以删除它们。

The combining tilde is represented by CC83 in HEX code and the acute is represented by CC81 in HEX. 组合波浪号由HEX代码中的CC83表示, CC83由HEX中的CC81表示。

So, the query that solves my problem is this one. 所以,解决我的问题的查询就是这个。

SELECT word, REPLACE(REPLACE(word, UNHEX("CC83"), ""), UNHEX("CC81"), "")
FROM oldword WHERE REPLACE(REPLACE(word, UNHEX("CC83"), ""), UNHEX("CC81"), "") 
LIKE 'hua%';`

The problem is that LIKE performs the comparison character-by-character and when using the "combining tilda", it literally is two characters, though it displays as one (assuming your client supports displaying it as such). 问题是LIKE逐个字符地执行比较,当使用“组合tilda”时,它实际上是两个字符,尽管它显示为一个(假设您的客户端支持显示它)。

There will never be a case where comparing eg hu~a to hua character-by-character will match because it's comparing ~ with a for the third character. 绝不会有比较哪里的情况下,例如hu~ahua字符一个字符将匹配,因为它比较~a第三字符。

Collations (and coercions) work in your favor and handle such things when comparing the string as a whole, but not when comparing character-by-character. 排序(和强制)对你有利,在比较整个字符串时处理这些事情,但在逐个字符比较时则不然。

Even if you considered using SUBSTRING() as a hack instead of using LIKE with a wildcard % to perform a prefix search, consider the following: 即使您考虑使用SUBSTRING()作为hack而不是使用LIKE和通配符%来执行前缀搜索,请考虑以下事项:

SELECT SUBSTRING('hũa', 1, 3) = 'hua'
-> 0
SELECT SUBSTRING('hũa', 1, 4) = 'hua'
-> 1

You kind of have to know the length you're going for or brute force it like this: 你需要知道你想要的长度或者像这样蛮力:

SELECT * FROM oldword
WHERE SUBSTRING(word, 1, 3) = 'hua'
   OR SUBSTRING(word, 1, 4) = 'hua'
   OR SUBSTRING(word, 1, 5) = 'hua'
   OR SUBSTRING(word, 1, 6) = 'hua'

According to this : 根据这个

ũ collates equal to plain U in all utf8 collations on 5.6. ũ在5.6的所有utf8排序规则中,整数等于普通U

collates equal to plain J in most collations; 在大多数校对中, collat​​es等于plain J ; exceptions: 例外:

  • utf8_general*ci because it is actually j plus an accent. utf8_general*ci因为它实际上是j加上重音。 And the "general" collations only look at one character (as distinguished from byte ) at a time. 并且“常规”排序规则一次只查看一个字符 (区别于字节 )。 Most collations take into consideration multiple characters, such as ch or ll in Spanish or ss in German. 大多数排序都会考虑多个字符,例如西班牙语中的chll或德语中的ss
  • utf8_roman_ci , which is quite an oddball. utf8_roman_ci ,这是一个非常奇怪的事。 j́=i=j

( LIKE does not exactly follow the regular rules of collation. I am not versed on the details, but I think that J is represented as 2 characters causes it to work differently in LIKE than in WHERE or ORDER BY . Furthermore, I don't know whether REPLACE() collates like LIKE or the other places.) LIKE并不完全遵循常规的整理规则。我并不精通细节,但我认为J表示为2个字符会导致它在LIKE工作方式与在WHEREORDER BY工作方式不同。此外,我不是知道REPLACE()是否像LIKE或其他地方一样整理。)

You can use the % symbol like a wildcard character. 您可以像使用通配符一样使用%符号。 For example this: 例如:

SELECT word
FROM myTable
WHERE word LIKE 'hua%';

This will pull all records that start with hua and have 0+ characters following it. 这将拉出所有以hua开头且后面跟着0+字符的记录。 Here is an SQL Fiddle example. 这是一个SQL小提琴示例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM