简体   繁体   English

mysql 重音不敏感和虚线不敏感搜索

[英]mysql accent insensitive and dotted insensitive search

The Problem : I am trying to implement a search algorithm that shows the results even when dotted chars are provided.问题:我正在尝试实现一种搜索算法,即使提供了虚线字符也能显示结果。 In other words: SELECT 'über' = 'uber' or SELECT 'mas' = 'maş' these results will return true.换句话说: SELECT 'über' = 'uber'SELECT 'mas' = 'maş'这些结果将返回 true。 This would apply for every single char in the following array:这将适用于以下数组中的每个字符:

$arr = array('ş' => 's', 'ç' => 'c', 'ö' => 'o', 'ü' => 'u' and so on ...);

The Solution In My Mind : Along with the original column, I can have a particular column that stores the English names.我心中的解决方案:除了原始列之外,我还可以有一个特定的列来存储英文名称。 So before storing 'über' to database, I will also convert it to 'uber' in php and then will store both 'über' (as the original) and 'uber' (as the searchable) to the database.因此,在将“über”存储到数据库之前,我还将在 php 中将其转换为“uber”,然后将“über”(作为原始)和“uber”(作为可搜索的)存储到数据库中。

But then, even though I've searched for this the whole day, I still believe that there should be a simplier and cleaner way to accomplish the task since this would mean (more or less) to store the same data twice in the database.但是,即使我已经搜索了一整天,我仍然相信应该有一种更简单、更清晰的方法来完成任务,因为这意味着(或多或少)将相同的数据存储在数据库中两次。 So guys, what do you think is the solution the only way to go or you know a better approach?那么伙计们,您认为解决方案是唯一的出路还是您知道更好的方法?

EDIT编辑

For accent insensitive I've seen the posts on SO, they are working but since I am also considering the dotted chars, I had to ask this question.对于口音不敏感,我已经看到了 SO 上的帖子,它们正在工作,但由于我也在考虑虚线字符,我不得不问这个问题。

EDIT2编辑2

I cannot post the whole table structure and code exactly for some reasons but I'll provide a close example.由于某些原因,我无法完全发布整个表结构和代码,但我将提供一个接近的示例。

myusers | CREATE TABLE `myusers` (
id int auto_increment not null primary key,
email varchar(100) COLLATE latin1_general_ci not null,
fullname varchar(75) COLLATE latin1_general_ci not null)
PRIMARY KEY('id')
) ENGINE=MyISAM AUTO_INCREMENET=2 DEFAULT CHARSET=latin1 COLLATE latin1_general_ci |

The above is the structure of the table.以上是表的结构。 Here comes the inserts and selects:这里是插入和选择:

INSERT INTO myusers (fullname) VALUES ('Agüeda');
INSERT INTO myusers (fullname) VALUES ('Agueda');

SELECT * FROM myusers WHERE fullname = 'Agüeda' COLLATE latin1_general_ci 

+----+-------+----------+
| id | email | fullname |
+----+-------+----------+
|  1 |       | Agüeda   |
+----+-------+----------+
1 row in set (0.00 sec)

SELECT * FROM myusers WHERE fullname = 'agueda' COLLATE latin1_general_ci 

+----+-------+----------+
| id | email | fullname |
+----+-------+----------+
|  2 |       | Agueda   |
+----+-------+----------+
1 row in set (0.00 sec)

Well, the desired result is obviously when agueda is searched both 'Agueda' and 'Agüeda' will return, but that's not the case.好吧,显然期望的结果是在搜索 agueda 时 'Agueda' 和 'Agüeda' 都会返回,但事实并非如此。 As I mentioned above, I have created a new column and store the whole name in English characters and make the search from there as well.正如我上面提到的,我创建了一个新列并以英文字符存储全名,并从那里进行搜索。 But still, it costs me a two times search (because I am also searching from the original columns which rank higher in the search result).但是,它仍然花费了我两次搜索(因为我也在搜索在搜索结果中排名更高的原始列)。 There should be a better way...应该有更好的方法...

Just use an appropriate collation.只需使用适当的排序规则。 For instance:例如:

create table test(
    foo text
) collate = utf8_unicode_ci;
insert into test values('Agüeda');
insert into test values('Agueda');
select * from test where foo = 'Agueda';

This gives your two rows.这给了你的两行。

1) Write your own collation. 1) 编写您自己的校对规则。 latin1_general_diacriticinsensitive. latin1_general_diacriticinsensitive。 I wouldn't even know where to begin, though :).不过,我什至不知道从哪里开始:)。

2) Use regex and character groups: /[uü]ber/ 2) 使用正则表达式和字符组:/[uü]ber/

3) The Solution In Your Mind. 3)你心中的解决方案。 I'd personally use this, since design is all about compromise and this is a simple solution with just a 100% space overhead.我个人会使用它,因为设计就是妥协,这是一个简单的解决方案,只有 100% 的空间开销。 Granted, the space overhead might eventually turn into a speed overhead, especially with MySQL, but that's to worry about later.诚然,空间开销最终可能会变成速度开销,尤其是对于 MySQL,但这是以后要担心的。 This is also very easy to undo if need be.如果需要,这也很容易撤消。

Well, instead of trying to replace them and run the search the x-times, I'd suggest using the mysql function LIKE ie好吧,与其尝试替换它们并运行 x 次搜索,我建议使用 mysql 函数LIKE ie

SELECT * FROM x WHERE search LIKE '%ber'

Where you have to replace the diacritics with "% .你必须用"%替换变音符号的地方。

EDIT: My mistake % replaces any number of characters.编辑:我的错误%替换了任意数量的字符。 Use _ for a single char.使用_表示单个字符。

Take a look at this post: https://stackoverflow.com/questions/500826看看这篇文章: https : //stackoverflow.com/questions/500826

He has just the opposite issue you're facing.他的问题与你面临的正好相反。 Look at the WHERE clause in the selected answer.查看所选答案中的 WHERE 子句。 Probably you could just use the _ci suffix and it'll work.可能你可以只使用_ci后缀,它会起作用。

Let us know how this is resolved.让我们知道这是如何解决的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM