简体   繁体   中英

MySQL select UTF-8 string with '=' but not with 'LIKE'

I have a table with some words that come from medieval books and have some accented letters that doesn't exists anymore in modern latin1 alphabet. I can represent these letters easily with UTF-8 combining characters. For example, to create a "J" with a tilde, I use the UTF-8 sequence \J+\̃ and the J becomes accented with a tilde.

The table uses utf8 encoding and the field collation is utf8_unicode_ci.

My problem is the following: If I try to select the entire string, I receive the correct answer. If I try to select using 'LIKE', I receive the wrong answer.

For example:

mysql> select word, hex(word) from oldword where word = 'hua';
+--------+--------------+
| word   | hex(word)    |
+--------+--------------+
| hũa    | 6875CC8361   |
| huã    | 6875C3A3     |
| hua    | 687561       |
| hũã    | 6875CC83C3A3 |
+--------+--------------+
4 rows in set (0,04 sec)

mysql> select word, hex(word) from oldword where word like 'hua';
+-------+------------+
| word  | hex(word)  |
+-------+------------+
| huã   | 6875C3A3   |
| hua   | 687561     |
+-------+------------+
2 rows in set (0,04 sec)

I don't want to search only the entire word. I want to search words that start with some substring. Eventually the searched word is the entire word.

How could I select the partial string using like and match all the strings?

I tried to create a custom collation using this information , but the server became unstable and only after a lot of trials and errors I was able to revert to the utf8_unicode_ci collation again and the server returned to normal condition.

EDIT: There's a problem with this site and some characters don't display correctly. Please see the results on these pastebins:

http://pastebin.com/mckJTLFX

http://pastebin.com/WP87QvgB

After seeing Marcus Adams' answer I realized that the REPLACE function could be the solution for this problem, although he didn't mentioned this function.

As I have only two different combining characters (acute and tilde), combined with other ASCII characters, for example j with tilde, j with acute, m with tilde, s with tilde, and so on. I just have to replace these two characters when using LIKE.

After searching the manual, I learned about the UNHEX function that helped me to properly represent the combining characters alone in the query to remove them.

The combining tilde is represented by CC83 in HEX code and the acute is represented by CC81 in HEX.

So, the query that solves my problem is this one.

SELECT word, REPLACE(REPLACE(word, UNHEX("CC83"), ""), UNHEX("CC81"), "")
FROM oldword WHERE REPLACE(REPLACE(word, UNHEX("CC83"), ""), UNHEX("CC81"), "") 
LIKE 'hua%';`

The problem is that LIKE performs the comparison character-by-character and when using the "combining tilda", it literally is two characters, though it displays as one (assuming your client supports displaying it as such).

There will never be a case where comparing eg hu~a to hua character-by-character will match because it's comparing ~ with a for the third character.

Collations (and coercions) work in your favor and handle such things when comparing the string as a whole, but not when comparing character-by-character.

Even if you considered using SUBSTRING() as a hack instead of using LIKE with a wildcard % to perform a prefix search, consider the following:

SELECT SUBSTRING('hũa', 1, 3) = 'hua'
-> 0
SELECT SUBSTRING('hũa', 1, 4) = 'hua'
-> 1

You kind of have to know the length you're going for or brute force it like this:

SELECT * FROM oldword
WHERE SUBSTRING(word, 1, 3) = 'hua'
   OR SUBSTRING(word, 1, 4) = 'hua'
   OR SUBSTRING(word, 1, 5) = 'hua'
   OR SUBSTRING(word, 1, 6) = 'hua'

According to this :

ũ collates equal to plain U in all utf8 collations on 5.6.

collates equal to plain J in most collations; exceptions:

  • utf8_general*ci because it is actually j plus an accent. And the "general" collations only look at one character (as distinguished from byte ) at a time. Most collations take into consideration multiple characters, such as ch or ll in Spanish or ss in German.
  • utf8_roman_ci , which is quite an oddball. j́=i=j

( LIKE does not exactly follow the regular rules of collation. I am not versed on the details, but I think that J is represented as 2 characters causes it to work differently in LIKE than in WHERE or ORDER BY . Furthermore, I don't know whether REPLACE() collates like LIKE or the other places.)

You can use the % symbol like a wildcard character. For example this:

SELECT word
FROM myTable
WHERE word LIKE 'hua%';

This will pull all records that start with hua and have 0+ characters following it. Here is an SQL Fiddle example.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM