简体   繁体   English

关键字突出显示突出显示PHP preg_replace()中的亮点

[英]keyword highlight is highlighting the highlights in PHP preg_replace()

I have a small search engine doing its thing, and want to highlight the results. 我有一个小型搜索引擎正在做它的事情,并希望突出结果。 I thought I had it all worked out till a set of keywords I used today blew it out of the water. 我认为我已经完成了所有工作,直到我今天使用的一组关键词将它从水中吹走。

The issue is that preg_replace() is looping through the replacements, and later replacements are replacing the text I inserted into previous ones. 问题是preg_replace()循环遍历替换,后来的替换正在替换我插入到以前的文本。 Confused? 困惑? Here is my pseudo function: 这是我的伪函数:

public function highlightKeywords ($data, $keywords = array()) {
    $find = array();
    $replace = array();
    $begin = "<span class=\"keywordHighlight\">";
    $end = "</span>";
    foreach ($keywords as $kw) {
        $find[] = '/' . str_replace("/", "\/", $kw) . '/iu';
        $replace[] = $begin . "\$0" . $end;
    }
    return preg_replace($find, $replace, $data);
}

OK, so it works when searching for "fred" and "dagg" but sadly, when searching for "class" and "lass" and "as" it strikes a real issue when highlighting "Joseph's Class Group" 好吧,所以它在搜索“fred”和“dagg”时起作用但遗憾的是,当搜索“class”,“lass”和“as”时,它突出了一个真正的问题,突出显示“Joseph's Class Group”

Joseph's <span class="keywordHighlight">Cl</span><span <span c<span <span class="keywordHighlight">cl</span>ass="keywordHighlight">lass</span>="keywordHighlight">c<span <span class="keywordHighlight">cl</span>ass="keywordHighlight">lass</span></span>="keywordHighlight">ass</span> Group

How would I get the latter replacements to only work on the non-HTML components, but to also allow the tagging of the whole match? 如何让后面的替换只能用于非HTML组件,还允许标记整个匹配? eg if I was searching for "cla" and "lass" I would want "class" to be highlighted in full as both the search terms are in it, even though they overlap, and the highlighting that was applied to the first match has "class" in it, but that shouldn't be highlighted. 例如,如果我正在搜索“cla”和“lass”,我希望“class”能够完整地突出显示,因为两个搜索词都在其中,即使它们重叠,并且应用于第一个匹配的突出显示“类”的,但应该被突出显示。

Sigh. 叹。

I would rather use a PHP solution than a jQuery (or any client-side) one. 我宁愿使用PHP解决方案而不是jQuery(或任何客户端)。

Note: I have tried to sort the keywords by length, doing the long ones first, but that means the cross-over searches do not highlight, meaning with "cla" and "lass" only part of the word "class" would highlight, and it still murdered the replacement tags :( 注意:我尝试按长度对关键字进行排序,首先执行长关键字,但这意味着交叉搜索不会突出显示,这意味着“cla”和“lass”只会突出显示“class”这个词的一部分,它仍然谋杀了替换标签:(

EDIT: I have messed about, starting with pencil & paper, and wild ramblings, and come up with some very unglamorous code to solve this issue. 编辑:我已经搞砸了,从铅笔和纸张开始,狂野的乱码,并提出一些非常无趣的代码来解决这个问题。 It's not great, so suggestions to trim/speed this up would still be greatly appreciated :) 它不是很好,所以建议修剪/加快这个仍然会非常感激:)

public function highlightKeywords ($data, $keywords = array()) {
    $find = array();
    $replace = array();
    $begin = "<span class=\"keywordHighlight\">";
    $end = "</span>";
    $hits = array();
    foreach ($keywords as $kw) {
        $offset = 0;
        while (($pos = stripos($data, $kw, $offset)) !== false) {
            $hits[] = array($pos, $pos + strlen($kw));
            $offset = $pos + 1;
        }
    }
    if ($hits) {
        usort($hits, function($a, $b) {
            if ($a[0] == $b[0]) {
                return 0;
            }
            return ($a[0] < $b[0]) ? -1 : 1;
        });
        $thisthat = array(0 => $begin, 1 => $end);
        for ($i = 0; $i < count($hits); $i++) {
            foreach ($thisthat as $key => $val) {
                $pos = $hits[$i][$key];
                $data = substr($data, 0, $pos) . $val . substr($data, $pos);
                for ($j = 0; $j < count($hits); $j++) {
                    if ($hits[$j][0] >= $pos) {
                        $hits[$j][0] += strlen($val);
                    }
                    if ($hits[$j][1] >= $pos) {
                        $hits[$j][1] += strlen($val);
                    }
                }
            }
        }
    }
    return $data;
}

I had to revisit this subject myself today and wrote a better version of the above. 我今天不得不重新审视这个主题,并写了一个更好的上述版本。 I'll include it here. 我会把它包括在这里。 It's the same idea only easier to read and should perform better since it uses arrays instead of concatenation. 同样的想法只是更容易阅读,并且应该更好地执行,因为它使用数组而不是连接。

<?php

function highlight_range_sort($a, $b) {
    $A = abs($a);
    $B = abs($b);
    if ($A == $B)
        return $a < $b ? 1 : 0;
    else
        return $A < $B ? -1 : 1;
}

function highlightKeywords($data, $keywords = array(),
       $prefix = '<span class="highlight">', $suffix = '</span>') {

        $datacopy = strtolower($data);
        $keywords = array_map('strtolower', $keywords);
        // this will contain offset ranges to be highlighted
        // positive offset indicates start
        // negative offset indicates end
        $ranges = array();

        // find start/end offsets for each keyword
        foreach ($keywords as $keyword) {
            $offset = 0;
            $length = strlen($keyword);
            while (($pos = strpos($datacopy, $keyword, $offset)) !== false) {
                $ranges[] = $pos;
                $ranges[] = -($offset = $pos + $length);
            }
        }

        if (!count($ranges))
            return $data;

        // sort offsets by abs(), positive
        usort($ranges, 'highlight_range_sort');

        // combine overlapping ranges by keeping lesser
        // positive and negative numbers
        $i = 0;
        while ($i < count($ranges) - 1) {
            if ($ranges[$i] < 0) {
                if ($ranges[$i + 1] < 0)
                    array_splice($ranges, $i, 1);
                else
                    $i++;
            } else if ($ranges[$i + 1] < 0)
                $i++;
            else
                array_splice($ranges, $i + 1, 1);
        }

        // create substrings
        $ranges[] = strlen($data);
        $substrings = array(substr($data, 0, $ranges[0]));
        for ($i = 0, $n = count($ranges) - 1; $i < $n; $i += 2) {
            // prefix + highlighted_text + suffix + regular_text
            $substrings[] = $prefix;
            $substrings[] = substr($data, $ranges[$i], -$ranges[$i + 1] - $ranges[$i]);
            $substrings[] = $suffix;
            $substrings[] = substr($data, -$ranges[$i + 1], $ranges[$i + 2] + $ranges[$i + 1]);
        }

        // join and return substrings
        return implode('', $substrings);
}

// Example usage:
echo highlightKeywords("This is a test.\n", array("is"), '(', ')');
echo highlightKeywords("Classes are as hard as they say.\n", array("as", "class"), '(', ')');
// Output:
// Th(is) (is) a test.
// (Class)es are (as) hard (as) they say.

OP - something that's not clear in the question is whether $data can contain HTML from the get-go. OP - 在问题中不明确的一点是$ data是否可以从一开始就包含HTML。 Can you clarify this? 你能澄清一下吗?

If $data can contain HTML itself, you are getting into the realms attempting to parse a non-regular language with a regular language parser, and that's not going to work out well. 如果$ data可以包含HTML本身,那么您将进入试图使用常规语言解析器解析非常规语言的领域,并且这样做不会很好。

In such a case, I would suggest loading the $data HTML into a PHP DOMDocument, getting hold of all of the textNodes and running one of the other perfectly good answers on the contents of each text block in turn. 在这种情况下,我建议将$ data HTML加载到PHP DOMDocument中,获取所有textNodes并依次运行每个文本块内容的其他完美答案之一。

I've used the following to address this problem: 我已经使用以下方法来解决这个问题:

<?php

$protected_matches = array();
function protect(&$matches) {
    global $protected_matches;
    return "\0" . array_push($protected_matches, $matches[0]) . "\0";
}
function restore(&$matches) {
    global $protected_matches;
    return '<span class="keywordHighlight">' .
              $protected_matches[$matches[1] - 1] . '</span>';
}

preg_replace_callback('/\x0(\d+)\x0/', 'restore',
    preg_replace_callback($patterns, 'protect', $target_string));

The first preg_replace_callback pulls out all matches and replaces them with nul-byte-wrapped placeholders; 第一个preg_replace_callback拉出所有匹配并用nul-byte-wrapped占位符替换它们; the second pass replaces them with the span tags. 第二遍用span标签替换它们。

Edit: Forgot to mention that $patterns was sorted by string length, longest to shortest. 编辑:忘了提到$patterns按字符串长度排序,最长到最短。

Edit; 编辑; another solution 另一种方案

<?php
        function highlightKeywords($data, $keywords = array(),
            $prefix = '<span class="hilite">', $suffix = '</span>') {

        $datacopy = strtolower($data);
        $keywords = array_map('strtolower', $keywords);
        $start = array();
        $end   = array();

        foreach ($keywords as $keyword) {
            $offset = 0;
            $length = strlen($keyword);
            while (($pos = strpos($datacopy, $keyword, $offset)) !== false) {
                $start[] = $pos;
                $end[]   = $offset = $pos + $length;
            }
        }

        if (!count($start)) return $data;

        sort($start);
        sort($end);

        // Merge and sort start/end using negative values to identify endpoints
        $zipper = array();
        $i = 0;
        $n = count($end);

        while ($i < $n)
            $zipper[] = count($start) && $start[0] <= $end[$i]
                ? array_shift($start)
                : -$end[$i++];

        // EXAMPLE:
        // [ 9, 10, -14, -14, 81, 82, 86, -86, -86, -90, 99, -103 ]
        // take 9, discard 10, take -14, take -14, create pair,
        // take 81, discard 82, discard 86, take -86, take -86, take -90, create pair
        // take 99, take -103, create pair
        // result: [9,14], [81,90], [99,103]

        // Generate non-overlapping start/end pairs
        $a = array_shift($zipper);
        $z = $x = null;
        while ($x = array_shift($zipper)) {
            if ($x < 0)
                $z = $x;
            else if ($z) {
                $spans[] = array($a, -$z);
                $a = $x;
                $z = null;
            }
        }
        $spans[] = array($a, -$z);

        // Insert the prefix/suffix in the start/end locations
        $n = count($spans);
        while ($n--)
            $data = substr($data, 0, $spans[$n][0])
            . $prefix
            . substr($data, $spans[$n][0], $spans[$n][1] - $spans[$n][0])
            . $suffix
            . substr($data, $spans[$n][1]);

        return $data;
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM