通过跳过定位标记来检查正则表达式

Question

我已经写了一个用于搜索特定关键字的正则表达式，并用特定的URL替换了该关键字。

我目前的正则表达式是： \\b$keyword\\b

其中的一个问题是，如果我的数据包含锚标记，而该标记包含此关键字，则此正则表达式也会替换锚标记中的该关键字。

我想搜索给定的数据，但锚标记除外。 请帮帮我。 感谢您的帮助。

例如。 关键字：迪士尼

I / P：

This is <a href="/test.php"> Disney </a> The disney should be replaceable

预期的O / P：

This is <a href="/test.php"> Disney </a> The <a href="any-url.php">disney</a> should be replaceable

o / p无效：

This is <a href="/test.php"> <a href="any-url.php">Disney</a> </a> The <a href="any-url.php">disney</a> should be replaceable

Answer 1

我已经修改了在页面上突出显示搜索词组的功能，在这里您可以：

$html = 'This is <a href="/test.php"> Disney </a> The disney should be replaceable.'.PHP_EOL;
$html .= 'Let\'s test also use of keyword inside other tags, for example as class name:'.PHP_EOL;
$html .= '<b class=disney></b> - this should not be replaced with link, and it isn\'t!'.PHP_EOL;

$result = ReplaceKeywordWithLink($html, "disney", "any-url.php");
echo nl2br(htmlspecialchars($result));

function ReplaceKeywordWithLink($html, $keyword, $link)
{
    if (strpos($html, "<") !== false) {
        $id = 0;
        $unique_array = array();
        // Hide existing anchor tags with some unique string.
        preg_match_all("#<a[^<>]*>[\s\S]*?</a>#i", $html, $matches);
        foreach ($matches[0] as $tag) {
            $id++;
            $unique_string = "@@@@@$id@@@@@";
            $unique_array[$unique_string] = $tag;
            $html = str_replace($tag, $unique_string, $html);
        }
        // Hide all tags by replacing with some unique string.
        preg_match_all("#<[^<>]+>#", $html, $matches);      
        foreach ($matches[0] as $tag) {
            $id++;
            $unique_string = "@@@@@$id@@@@@";
            $unique_array[$unique_string] = $tag;
            $html = str_replace($tag, $unique_string, $html);
        }
    }
    // Then we replace the keyword with link.
    $keyword = preg_quote($keyword);
    assert(strpos($keyword, '$') === false);
    $html = preg_replace('#(\b)('.$keyword.')(\b)#i', '$1<a href="'.$link.'">$2</a>$3', $html);
    // We get back all the tags by replacing unique strings with their corresponding tag.
    if (isset($unique_array)) {     
        foreach ($unique_array as $unique_string => $tag) {
            $html = str_replace($unique_string, $tag, $html);
        }
    }
    return $html;
}

结果：

This is <a href="/test.php"> Disney </a> The <a href="any-url.php">disney</a> should be replaceable.
Let's test also use of keyword inside other tags, for example as class name:
<b class=disney></b> - this should not be replaced with link, and it isn't!

Answer 2

将其添加到正则表达式的末尾：

(?=[^<]*(?:<(?!/?a\b)[^<]*)*(?:<a\b|\z))

此前瞻尝试匹配下一个打开的<a>标记或输入的结尾，但前提是它首先看不到结束</a>标记。 假设HTML格式最小，只要匹配在<a>标记开始之后和相应的</a>标记之前开始，前瞻就会失败。

为了防止它与其他任何标签匹配（例如<div class="disney"> ），您还可以添加以下前瞻：

(?![^<>]*+>)

有了这个，我假设标签的属性值中没有任何尖括号，根据HTML 4规范是合法的，但在现实世界中极为罕见。

如果您正在以PHP双引号字符串的形式编写正则表达式（如果您希望替换$keyword变量，则必须使用它），您应该将所有反斜杠加倍。 \\z可能不是问题，但我相信\\b将被解释为退格，而不是单词边界断言。

编辑：关于第二个想法，绝对不添加第二个先行-我的意思是，为什么不想阻止标签内的比赛？ 并将其放在首位，因为它往往比另一个更快地评估：

(?![^<>]*+>)(?=[^<]*(?:<(?!/?a\b)[^<]*)*(?:<a\b|\z))

Answer 3

首先剥离标签，然后搜索剥离的文本。

通过跳过定位标记来检查正则表达式

问题描述

3 个解决方案

解决方案1
2 已采纳 2011-11-15 11:40:51

解决方案2
1 2011-11-15 12:10:14

解决方案3
0 2011-11-15 09:23:32

通过跳过定位标记来检查正则表达式

问题描述

3 个解决方案

解决方案1 2 已采纳 2011-11-15 11:40:51

解决方案2 1 2011-11-15 12:10:14

解决方案3 0 2011-11-15 09:23:32

解决方案1
2 已采纳 2011-11-15 11:40:51

解决方案2
1 2011-11-15 12:10:14

解决方案3
0 2011-11-15 09:23:32