简体   繁体   English

如何在MediaWiki数据库中进行重音和不区分大小写的搜索?

[英]How to do an accent and case-insensitive search in MediaWiki database?

Let's pretend that I have these page titles in my wiki (MediaWiki 1.19.4): 让我假装我的wiki中有这些页面标题(MediaWiki 1.19.4):

SOMETHIng
Sómethìng
SomêthÏng
SÒmetHínG

If a user searches something I want that all 4 pages are returned as the result. 如果用户搜索something我想,所有的4页的结果中返回。

At the moment the only thing I could think of is this query (MySQL Percona 5.5.30-30.2): 目前,我唯一能想到的就是这个查询(MySQL Percona 5.5.30-30.2):

SELECT page_title
FROM page
WHERE page_title LIKE '%something%' COLLATE utf8_general_ci

Which only returns SOMETHIng . 这只返回SOMETHIng

I must be on the right path, because if I search sóméthíng OR SÓMÉTHÍNG , I get SOMETHIng as the result. 我必须走正确的道路,因为如果我搜索sóméthíngSÓMÉTHÍNG ,我会得到SOMETHIng How could I modify the query so I get the other results as expected? 我怎样才能修改查询,以便按预期获得其他结果? Performance is not critical here since the page table contains only ~2K rows. 由于page表仅包含~2K行,因此性能并不重要。

This is the table definition with the relevant bits: 这是具有相关位的表定义:

CREATE TABLE page (
    (...)
    page_title VARCHAR(255) NOT NULL DEFAULT '' COLLATE latin1_bin,
    (...)
    UNIQUE INDEX name_title (page_namespace, page_title),
)

The table definition must not be modified, since this is a stock installation of MediaWiki and AFAIK its code expects this field being defined that way (ie unicode stored as binary data). 不能修改表定义,因为这是MediaWiki和AFAIK的库存安装,其代码期望以这种方式定义该字段(即unicode存储为二进制数据)。

The MediaWiki TitleKey extension is basically designed for this, but it only does case-folding. MediaWiki TitleKey扩展基本上是为此而设计的,但它只进行大小写折叠。 However, if you don't mind hacking it a bit, and have the PHP iconv extension installed, you could edit TitleKey_body.php and replace the method: 但是,如果您不介意破解它,并安装了PHP iconv扩展 ,您可以编辑TitleKey_body.php并替换方法:

static function normalize( $text ) {
    global $wgContLang;
    return $wgContLang->caseFold( $text );
}

with eg: 例如:

static function normalize( $text ) {
    return strtoupper( iconv( 'UTF-8', 'US-ASCII//TRANSLIT', $text ) );
}

and (re)run rebuildTitleKeys.php. 并(重新)运行rebuildTitleKeys.php。

The TitleKey extension stores its normalized titles in a separate table , surprisingly named titlekey . TitleKey扩展将其标准化标题存储在一个单独的表中 ,令人惊讶地命名为titlekey It's intended to accessed through the MediaWiki search interface, but if you want, you can certainly query it directly too, eg like this: 它打算通过MediaWiki搜索界面访问,但如果你愿意,你当然也可以直接查询它,例如:

SELECT page.* FROM page
  JOIN titlekey ON tk_page = page_id
WHERE tk_namespace = 0 AND tk_key = 'SOMETHING';

I found the perfect solution, no modyfing or creating tables. 我找到了完美的解决方案,没有打造或创造表格。 It might have performance implications (I didn't test), but as I stated in my question, it's a ~2K rows table so it shouldn't matter much. 可能有性能影响(我没有测试),但正如我在我的问题中所说,它是一个~2K行表,所以它应该没什么关系。

The root of the problem is that MediaWiki stores UTF8-encoded text in latin1-encoded tables . 问题的根源是MediaWiki在latin1编码的表中存储UTF8编码的文本 It doesn't matter much to MediaWiki since it's aware of it and it'll always query the database with the correct charset and do its thing, essentially using MySQL as a dumb bit container . 它对MediaWiki来说并不重要,因为它知道它并且它总是用正确的字符集查询数据库并做它的事情,基本上使用MySQL作为一个哑位容器 It does this because apparently UTF8 support in MySQL is not adequate for its needs (see comments in MediaWiki's DefaultSettings.php , variable $wgDBmysql5 ). 这样做是因为显然MySQL中的UTF8支持不足以满足其需求(请参阅MediaWiki的DefaultSettings.php ,变量$wgDBmysql5 )。

The problem appears when you want the database itself to be able to perform UTF8-aware operations (like I wanted to do in my question). 当您希望数据库本身能够执行UTF8感知操作时(如我想在我的问题中所做的那样),会出现问题。 You won't be able to do that because as far as MySQL knows, it's not storing UTF8-encoded text (although it is, as explained in the previous paragraph). 你将无法做到这一点,因为据MySQL所知,它不存储UTF8编码的文本 (尽管它是如前一段所述)。

There's an obvious solution for this: cast to UTF8 the column you want to use, something like this CONVERT(col_name USING utf8) . 有一个明显的解决方案:将你要使用的列转换为UTF8,类似于CONVERT(col_name USING utf8) The problem here is that MySQL is trying to be dangerously helpful: it thinks that col_name is storing latin1-encoded text and it will translate (not encode) each byte into its UTF8 equivalent, and you will end with double-encoded UTF8, which is obviously wrong. 这里的问题是MySQL正在尝试危险的帮助:它认为col_name正在存储latin1编码的文本,它会将每个字节转换(不编码)为其UTF8等价物,并且你将以双重编码的UTF8结束,这是显然错了。

How to avoid MySQL being so nice and helpful? 如何避免MySQL如此美好和有用? Just cast to BINARY before doing the conversion to UTF8! 在转换为UTF8 之前,只需转换为BINARY! That way MySQL won't assume anything and will do exactly as asked: encoding this bunch of bits into UTF8. 这样MySQL就不会采取任何行动,并且会完全按照要求执行:将这一批位编码为UTF8。 The exact syntax is CONVERT(CAST(col_name AS BINARY) USING utf8) . 确切的语法是CONVERT(CAST(col_name AS BINARY) USING utf8)

So this is my final query now: 所以这是我最后的查询:

SELECT CONVERT(CAST(page_title AS BINARY) USING utf8)
FROM page
WHERE
    CONVERT(CAST(page_title AS BINARY) USING utf8)
        LIKE '%keyword_here%'
            COLLATE utf8_spanish_ci

Now if I search something or sôMëthîNG or any variation, I get all the results! 现在,如果我搜索somethingsôMëthîNG或任何变体,我会得到所有结果!

Please note that I used utf8_spanish_ci because I want the search to differentiate ñ from n but not á from a . 请注意,我用utf8_spanish_ci因为我想搜索区分ñn但不áa Use a different collation according to your use case ( here is a complete list ). 根据您的使用情况使用不同的排序规则( 这是完整列表 )。

Related links: 相关链接:

Case insensitive: you can simply let the database do the work for you (you already do with _ci ) 不区分大小写:您可以让数据库为您完成工作(您已经使用_ci

Accents: In order to have all accents or at least all known accents you could use two rows in your database. 口音:为了拥有所有重音符号或至少所有已知的重音符号,您可以在数据库中使用两行。 The first row stores the result as it is (it means you store SomêthÏng ) and you create additionally a second search_row which would in this case contain the string something (without any accents). 第一行存储结果,因为它是(这意味着你存储的东西 ),并创建一个额外第二search_row这将在这种情况下,包含字符串的东西 (不带任何口音)。 For the conversion you can make a function with replace rules. 对于转换,您可以使用替换规则来创建函数。

Now you can convert the search string using the conversion function. 现在,您可以使用转换函数转换搜索字符串

The last step is, you make a trigger, which fills/updates the field search_row every time you insert/update the title in the table page . 最后一步是,您创建一个触发器,每次在表格页面中插入/更新标题时,都会填充/更新字段search_row

This solution wouldn't have any negative impact on the performance either! 这个解决方案也不会对性能产生任何负面影响!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM