简体   繁体   中英

Highlighting Search Results: RegEx Character Collation?

When I run a fulltext MySQL query, thanks to Unicode character collations I will get results matching all of the following, whichever of them I may query for: saka, sakā, śāka, ṣaka etc.

Where I'm stuck is with highlighting the matches in search results. With standard RegEx, I can only match and highlight the original query word in the results -- not all the collated matches.

How would one go about solving this? I've initially thought of these approaches:

  • Creating a RegEx pattern that would analyze the target results against all possible variants. Would easily turn into one monster of a bloated pattern.
  • Creating a normalized version of the results, locating the matches there, and using the string positions as a basis for highlighting.

However both these approaches incur a substantial processing overhead compared to a regular search result highlighting. The first approach would incur a mighty CPU overhead; the second would probably eat up less CPU but munch at least twice the RAM for the results. Any suggestions?

PS In case it's relevant: The specific character set I'm dealing with (IAST for Sanskrit transliteration with extensions) has three variants of L and N; two variants of M, R and S; and one variant of A, D, E, H, I, T and U; in total AZ + 19 diacritic variants; + uppercase (that poses no problem here).

With MySQL and its REGEXP, you can only locate row(s) that match the REGEXP. You cannot locate the match within the column.

REGEXP and LIKE both honor the collation of the column in question, but that does not help in locating the text withing the column.

Check out MariaDB and its REGEXP_REPLACE.

MySQL at least has a bug relating to it: http://bugs.mysql.com/bug.php?id=70767

Here's what I ended up doing. Seems to have negligible impact on performance. (I noticed none!)

First, a function that converts the query word into a regular expression iterating the variants:

function iast_normalize_regex($str) {

    $subst = [ 
        'a|ā', 'd|ḍ', 'e|ӗ', 'h|ḥ', 'i|ī', 'l|ḷ|ḹ', 'm|ṁ|ṃ', 
        'n|ñ|ṅ|ṇ', 'r|ṛ|ṝ', 's|ś|ṣ', 't|ṭ', 'u|ū' 
        ];

    $subst_rex = [];

    foreach($subst as $variants) {
        $chars = explode('|', $variants);
        foreach($chars as $char) {
            $subst_rex[$char] = "({$variants})";
        }
    }

    $str_chars = str_split_unicode($str);

    $str_rex = '';
    foreach($str_chars as $char) {
        $str_rex .= !isset($subst_rex[$char]) ? $char : $subst_rex[$char];
    }

    return $str_rex;
}

Which turns the words saka , śaka etc. into (s|ś|ṣ)(a|ā)k(a|ā) . Then, the variant-iterated word-pattern is used to highlight the search results:

$word = iast_normalize_regex($word);
$result = preg_replace("#({$word})#iu", "<b>$1</b>", $result);

Presto: I get all the variants highlighted. Thanks for the contributions so far, and please let me know if you can think of better ways to accomplish this. Cheers!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM