简体   繁体   English

口音敏感的全文搜索(MySQL)

[英]Accent sensitive FULL TEXT search (MySQL)

Hopefully I just can't see the forest for the trees but my full text search behaves very strangely and I cannot solve this by myself. 希望我看不到森林,但是我的全文搜索行为很奇怪,我自己无法解决。 (I tried to search for a solution but so far no luck, so any help is greatly appreciated.) (我试图寻找解决方案,但到目前为止还算不上运气,因此,非常感谢您的帮助。)

So my problem is: if I search for " tök " (it means "pumpkin" in Hungarian) the list also contains results with " tok " (which means "case"). 所以我的问题是:如果我搜索“ tök ”(在匈牙利语中表示“ pumpkin”),则列表中还包含带有“ tok ”的结果(即“ case”)。 If I search for a pumpkin I clearly don't want a phone case or such things. 如果我搜索南瓜,我显然不想要手机壳或类似的东西。

My system is MySQL every table is in InnoDB, utf8_general_ci 我的系统是MySQL每个表都在InnoDB中,utf8_general_ci

this is the ( simplified ) query: 这是(简化的)查询:

SELECT id_item,item_title,tag_name, MATCH (item_title) AGAINST ('tök' IN NATURAL LANGUAGE MODE) AS title_relevance, MATCH (tag_name) AGAINST ('tök' IN NATURAL LANGUAGE MODE) AS tag_relevance 
FROM item_translations 
WHERE NULL IS NULL 
AND (   MATCH (tile_item_title) AGAINST ('+tök' IN NATURAL LANGUAGE MODE ) OR MATCH (tag_name) AGAINST ('+tök' IN NATURAL LANGUAGE MODE ) ) 
AND id_language=1 
ORDER BY title_relevance DESC, tag_relevance DESC 
LIMIT 0,40

PS: the keywords are not always in Hungarian because this website is multilingual so I need a relatively flexible solution which works with most of the accented letters (if it's possible) PS:关键字并不总是匈牙利语,因为该网站是多语言的,所以我需要一个相对灵活的解决方案,该解决方案可以处理大多数带重音的字母(如果可能)

Equality in a string comparison is specified by the collation. 字符串比较中的相等由排序规则指定。 general will treat every letter like their (latin) base character. general会像对待自己的(拉丁)基本字符的每一个字母。 You need to specify a collation that supports the accents and umlauts that you want to be distinct. 您需要指定一个排序规则,以支持要与众不同的重音符号和变音符号。

The collation includes the language specifics. 归类包括语言说明。 Eg for spanish, n < ñ < o (while n = ñ for basically every other language), for swedish you have Y = Ü , for german (and most collations) there is ß = ss , and for hungarian (and many other collations) you have o < ö . 例如,对于西班牙语, n < ñ < o (而对于基本上所有其他语言, n = ñ ),对于瑞典语,您具有Y = Ü ,对于德语(和大多数归类)来说, ß = ss ,对于匈牙利语(以及许多其他归类) )你有o < ö

So for a hungarian site, you may want to choose utf8_hungarian_ci , and if your software is localizable to a specific language (and audience), you may want to adjust the collation of that column or let the administrator choose one. 因此,对于匈牙利站点,您可能想要选择utf8_hungarian_ci ,并且如果您的软件可本地化为特定语言(和受众),则可能需要调整该列的排序规则或让管理员选择一个排序规则。 Unfortunately, for a fulltext search (in contrast to other string comparisons like = or order by ), you cannot specify a collation in the query on the fly, so you need to choose a single one. 不幸的是,对于全文搜索(与其他字符串比较(例如=order by ),您无法在查询中即时指定排序规则,因此您需要选择一个。

On a general multilanguage site, most users will probably expect a search to fit a very general english/russian/chinese-schema, and would not be surprised if they find tök when entering tok . 在通用的多语言站点上,大多数用户可能希望搜索符合非常通用的英语/俄语/中文模式,如果在输入tok时找到了tök ,也不会感到惊讶。 They might even be irritated to not get those, especially if they do not have an ö on their keyboard and actually want to buy a pumpkin (and know the hungarian word for it). 他们甚至可能会生气而没有得到这些,特别是如果他们的键盘上没有ö并实际上想购买南瓜(并且知道匈牙利的意思)。 Most search engines will actually try to not be too narrow, and want to find café when you enter cafe , and oftentimes put some work into being able to find café when you enter coffee , caffé or cafée . 大多数搜索引擎实际上将尽量不要太狭窄,并希望找到café ,当你进入cafe ,并经常把一些工作纳入能够找到café ,当你进入coffeecaffécafée

There is no language that will handle every accent and umlaut differently though. 但是,没有一种语言能处理不同的口音和变音符号。 If you really want to distinguish every single special character, you may want to try utf8_bin (although I am not sure if I would call it most flexible ). 如果您确实想区分每个特殊字符,则可以尝试utf8_bin (尽管我不确定是否将其称为最灵活的 )。 It is important to note that it is case sensitive, but since a fulltext search is always case insensitive, this would not matter. 重要的是要注意它区分大小写,但是由于全文搜索始终不区分大小写,因此这无关紧要。 If you do other string comparisons on this column (eg like ), this can be problematic though. 如果您在本专栏做其他字符串比较(如like ),这可能是有问题的,但。 Also, you will loose language specific behaviour, eg Y = Ü or ß = ss (unless you implement it yourself). 同样,您将失去特定于语言的行为,例如Y = Üß = ss (除非您自己实现)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM