简体   繁体   English

突出显示搜索结果:RegEx字符排序规则?

[英]Highlighting Search Results: RegEx Character Collation?

When I run a fulltext MySQL query, thanks to Unicode character collations I will get results matching all of the following, whichever of them I may query for: saka, sakā, śāka, ṣaka etc. 当我运行全文MySQL查询时,由于使用Unicode字符归类,我将获得与以下所有内容匹配的结果,无论我查询的是以下哪一项: saka, sakā, śāka, ṣaka等。

Where I'm stuck is with highlighting the matches in search results. 我遇到的问题是突出显示搜索结果中的匹配项。 With standard RegEx, I can only match and highlight the original query word in the results -- not all the collated matches. 使用标准RegEx,我只能在结果中匹配并突出显示原始查询词-并非所有归类的匹配项。

How would one go about solving this? 如何解决这个问题? I've initially thought of these approaches: 我最初想到的是这些方法:

  • Creating a RegEx pattern that would analyze the target results against all possible variants. 创建一个RegEx模式,以针对所有可能的变体分析目标结果。 Would easily turn into one monster of a bloated pattern. 会轻易变成一个monster肿的怪物。
  • Creating a normalized version of the results, locating the matches there, and using the string positions as a basis for highlighting. 创建结果的规范化版本,在其中找到匹配项,并使用字符串位置作为突出显示的基础。

However both these approaches incur a substantial processing overhead compared to a regular search result highlighting. 但是,与常规搜索结果突出显示相比,这两种方法都招致了大量处理开销。 The first approach would incur a mighty CPU overhead; 第一种方法会产生大量的CPU开销; the second would probably eat up less CPU but munch at least twice the RAM for the results. 第二个可能会消耗更少的CPU,但至少要消耗两倍的RAM才能获得结果。 Any suggestions? 有什么建议么?

PS In case it's relevant: The specific character set I'm dealing with (IAST for Sanskrit transliteration with extensions) has three variants of L and N; PS:如果涉及到:我正在处理的特定字符集(带有扩展名的IAST梵文音译)具有L和N的三个变体; two variants of M, R and S; M,R和S的两个变体; and one variant of A, D, E, H, I, T and U; 和A,D,E,H,I,T和U的一个变体; in total AZ + 19 diacritic variants; 总共AZ + 19个变音符号变体; + uppercase (that poses no problem here). +大写字母(这里没有问题)。

With MySQL and its REGEXP, you can only locate row(s) that match the REGEXP. 使用MySQL及其REGEXP,您只能找到与REGEXP匹配的行。 You cannot locate the match within the column. 您无法在该列中找到匹配项。

REGEXP and LIKE both honor the collation of the column in question, but that does not help in locating the text withing the column. REGEXP和LIKE都尊重相关列的排序规则,但这无助于查找带有该列的文本。

Check out MariaDB and its REGEXP_REPLACE. 签出MariaDB及其REGEXP_REPLACE。

MySQL at least has a bug relating to it: http://bugs.mysql.com/bug.php?id=70767 MySQL至少有一个与之相关的错误: http : //bugs.mysql.com/bug.php?id=70767

Here's what I ended up doing. 这就是我最终要做的。 Seems to have negligible impact on performance. 似乎对性能的影响可以忽略不计。 (I noticed none!) (我没有注意到!)

First, a function that converts the query word into a regular expression iterating the variants: 首先,该函数将查询词转换为迭代变体的正则表达式:

function iast_normalize_regex($str) {

    $subst = [ 
        'a|ā', 'd|ḍ', 'e|ӗ', 'h|ḥ', 'i|ī', 'l|ḷ|ḹ', 'm|ṁ|ṃ', 
        'n|ñ|ṅ|ṇ', 'r|ṛ|ṝ', 's|ś|ṣ', 't|ṭ', 'u|ū' 
        ];

    $subst_rex = [];

    foreach($subst as $variants) {
        $chars = explode('|', $variants);
        foreach($chars as $char) {
            $subst_rex[$char] = "({$variants})";
        }
    }

    $str_chars = str_split_unicode($str);

    $str_rex = '';
    foreach($str_chars as $char) {
        $str_rex .= !isset($subst_rex[$char]) ? $char : $subst_rex[$char];
    }

    return $str_rex;
}

Which turns the words saka , śaka etc. into (s|ś|ṣ)(a|ā)k(a|ā) . sakaśaka等词变成(s|ś|ṣ)(a|ā)k(a|ā) Then, the variant-iterated word-pattern is used to highlight the search results: 然后,使用变体重复词模式来突出显示搜索结果:

$word = iast_normalize_regex($word);
$result = preg_replace("#({$word})#iu", "<b>$1</b>", $result);

Presto: I get all the variants highlighted. Presto:我将所有变体突出显示。 Thanks for the contributions so far, and please let me know if you can think of better ways to accomplish this. 感谢您到目前为止所做的贡献,如果您能想到实现此目标的更好方法,请告诉我。 Cheers! 干杯!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM