[英]Accent sensitive FULL TEXT search (MySQL)
Hopefully I just can't see the forest for the trees but my full text search behaves very strangely and I cannot solve this by myself. 希望我看不到森林,但是我的全文搜索行为很奇怪,我自己无法解决。 (I tried to search for a solution but so far no luck, so any help is greatly appreciated.)
(我试图寻找解决方案,但到目前为止还算不上运气,因此,非常感谢您的帮助。)
So my problem is: if I search for " tök " (it means "pumpkin" in Hungarian) the list also contains results with " tok " (which means "case"). 所以我的问题是:如果我搜索“ tök ”(在匈牙利语中表示“ pumpkin”),则列表中还包含带有“ tok ”的结果(即“ case”)。 If I search for a pumpkin I clearly don't want a phone case or such things.
如果我搜索南瓜,我显然不想要手机壳或类似的东西。
My system is MySQL every table is in InnoDB, utf8_general_ci 我的系统是MySQL每个表都在InnoDB中,utf8_general_ci
this is the ( simplified ) query: 这是(简化的)查询:
SELECT id_item,item_title,tag_name, MATCH (item_title) AGAINST ('tök' IN NATURAL LANGUAGE MODE) AS title_relevance, MATCH (tag_name) AGAINST ('tök' IN NATURAL LANGUAGE MODE) AS tag_relevance
FROM item_translations
WHERE NULL IS NULL
AND ( MATCH (tile_item_title) AGAINST ('+tök' IN NATURAL LANGUAGE MODE ) OR MATCH (tag_name) AGAINST ('+tök' IN NATURAL LANGUAGE MODE ) )
AND id_language=1
ORDER BY title_relevance DESC, tag_relevance DESC
LIMIT 0,40
PS: the keywords are not always in Hungarian because this website is multilingual so I need a relatively flexible solution which works with most of the accented letters (if it's possible) PS:关键字并不总是匈牙利语,因为该网站是多语言的,所以我需要一个相对灵活的解决方案,该解决方案可以处理大多数带重音的字母(如果可能)
Equality in a string comparison is specified by the collation. 字符串比较中的相等由排序规则指定。
general
will treat every letter like their (latin) base character. general
会像对待自己的(拉丁)基本字符的每一个字母。 You need to specify a collation that supports the accents and umlauts that you want to be distinct. 您需要指定一个排序规则,以支持要与众不同的重音符号和变音符号。
The collation includes the language specifics. 归类包括语言说明。 Eg for spanish,
n < ñ < o
(while n = ñ
for basically every other language), for swedish you have Y = Ü
, for german (and most collations) there is ß = ss
, and for hungarian (and many other collations) you have o < ö
. 例如,对于西班牙语,
n < ñ < o
(而对于基本上所有其他语言, n = ñ
),对于瑞典语,您具有Y = Ü
,对于德语(和大多数归类)来说, ß = ss
,对于匈牙利语(以及许多其他归类) )你有o < ö
。
So for a hungarian site, you may want to choose utf8_hungarian_ci
, and if your software is localizable to a specific language (and audience), you may want to adjust the collation of that column or let the administrator choose one. 因此,对于匈牙利站点,您可能想要选择
utf8_hungarian_ci
,并且如果您的软件可本地化为特定语言(和受众),则可能需要调整该列的排序规则或让管理员选择一个排序规则。 Unfortunately, for a fulltext search (in contrast to other string comparisons like =
or order by
), you cannot specify a collation in the query on the fly, so you need to choose a single one. 不幸的是,对于全文搜索(与其他字符串比较(例如
=
或order by
),您无法在查询中即时指定排序规则,因此您需要选择一个。
On a general multilanguage site, most users will probably expect a search to fit a very general english/russian/chinese-schema, and would not be surprised if they find tök
when entering tok
. 在通用的多语言站点上,大多数用户可能希望搜索符合非常通用的英语/俄语/中文模式,如果在输入
tok
时找到了tök
,也不会感到惊讶。 They might even be irritated to not get those, especially if they do not have an ö
on their keyboard and actually want to buy a pumpkin (and know the hungarian word for it). 他们甚至可能会生气而没有得到这些,特别是如果他们的键盘上没有
ö
并实际上想购买南瓜(并且知道匈牙利的意思)。 Most search engines will actually try to not be too narrow, and want to find café
when you enter cafe
, and oftentimes put some work into being able to find café
when you enter coffee
, caffé
or cafée
. 大多数搜索引擎实际上将尽量不要太狭窄,并希望找到
café
,当你进入cafe
,并经常把一些工作纳入能够找到café
,当你进入coffee
, caffé
或cafée
。
There is no language that will handle every accent and umlaut differently though. 但是,没有一种语言能处理不同的口音和变音符号。 If you really want to distinguish every single special character, you may want to try
utf8_bin
(although I am not sure if I would call it most flexible ). 如果您确实想区分每个特殊字符,则可以尝试
utf8_bin
(尽管我不确定是否将其称为最灵活的 )。 It is important to note that it is case sensitive, but since a fulltext search is always case insensitive, this would not matter. 重要的是要注意它区分大小写,但是由于全文搜索始终不区分大小写,因此这无关紧要。 If you do other string comparisons on this column (eg
like
), this can be problematic though. 如果您在本专栏做其他字符串比较(如
like
),这可能是有问题的,但。 Also, you will loose language specific behaviour, eg Y = Ü
or ß = ss
(unless you implement it yourself). 同样,您将失去特定于语言的行为,例如
Y = Ü
或ß = ss
(除非您自己实现)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.