简体   繁体   English

正则表达式匹配字符串中的单词或短语但不匹配URL或内部的一部分 <a> </a> 标签。 (PHP)

[英]Regex to match words or phrases in string but NOT match if part of a URL or inside <a> </a> tags. (php)

I am aware that regex is not ideal for use with HTML strings and I have looked at the PHP Simple HTML DOM Parser but still believe this is the way to go. 我知道正则表达式不适合与HTML字符串一起使用,我已经看过PHP Simple HTML DOM Parser,但仍然相信这是要走的路。 All the HTML tags will be generated by my forum software so they will be consistent and valid HTML. 所有HTML标签都将由我的论坛软件生成,因此它们将是一致且有效的HTML。

What I am trying to do is make a plugin that will find a list of keywords (or phrases) in a string of HTML and replace them with a link I specify. 我想要做的是制作一个插件,它将在HTML字符串中找到一个关键字(或短语)列表,并用我指定的链接替换它们。 For example if someone types: 例如,如果有人输入:

I use Amazon for that.

it would replace it with: 它将取代它:

I use <a href="http://www.amazon.com">Amazon</a> for that.

The problem is of course is that if "amazon" is in the URL it would also get replaced. 问题当然是如果“亚马逊”在URL中,它也会被替换。 I solved that issue with a callback function found on this site, slightly modified. 我用这个网站上的回调函数解决了这个问题,略有修改。

But now I still have an issue, it still replaces words between opening and closing tags. 但是现在我还有一个问题,它仍然取代了开始和结束标签之间的单词。

<a href="http://www.amazon.com">My Amazon Link</a>

It will match the "Amazon" in "My Amazon Link" 它将匹配“我的亚马逊链接”中的“亚马逊”

What I really need is a regex to match say "amazon" anywhere except between <a href and </a> 我真正需要的是一个正则表达式匹配说,“亚马逊”除了之间的任何<a href</a>

Any ideas? 有任何想法吗?

Using the DOM would certainly be preferable. 使用DOM肯定是更可取的。

However, you might get away with this: 但是,你可能会逃避这个:

$result = preg_replace('%Amazon(?![^<]*</a>)%i', '<a href="http://www.amazon.com">Amazon</a>', $subject);

It matches Amazon only if 它只匹配Amazon

  1. it's not followed by a closing </a> tag, 它后面没有关闭</a>标签,
  2. it's not itself part of a tag, 它本身不是标签的一部分,
  3. there are no intervening tags, ie it will be thrown off if tags can be nested inside <a> tags. 没有插入标记,即如果标记可以嵌套在<a>标记内,它将被抛弃。

It will therefore change this: 因此它会改变这个:

I use Amazon for that.
I use <a href="http://www.amazon.com">Amazon</a> for that.
<a href="http://www.amazon.com">My Amazon Link</a>
It will match the "Amazon" in "My Amazon Link"

into this: 进入这个:

I use <a href="http://www.amazon.com">Amazon</a> for that.
I use <a href="http://www.amazon.com">Amazon</a> for that.
<a href="http://www.amazon.com">My Amazon Link</a>
It will match the "<a href="http://www.amazon.com">Amazon</a>" in "My <a href="http://www.amazon.com">Amazon</a> Link"

Don't do this. 不要这样做。 You cannot reliably do this with Regex, no matter how consistent your HTML is. 无论您的HTML多么一致,您都无法使用Regex可靠地执行此操作。

Something like this should work, however: 但是这样的事情应该有效:

<?php
$dom = new DOMDocument;
$dom->load('test.xml');
$x = new DOMXPath($dom);

$nodes = $x->query("//text()[contains(., 'Amazon')][not(ancestor::a)]");

foreach ($nodes as $node) {
    while (false !== strpos($node->nodeValue, 'Amazon')) {
        $word = $node->splitText(strpos($node->nodeValue, 'Amazon'));
        $after = $word->splitText(6);

        $link = $dom->createElement('a');
        $link->setAttribute('href', 'http://www.amazon.com');

        $word->parentNode->replaceChild($link, $word);
        $link->appendChild($word);

        $node = $after;
    }
}

$html = $dom->saveHTML();
echo $html;

It's verbose, but it will actually work. 它很冗长,但实际上会有效。

Try this here 试试吧

Amazon(?![^<]*</a>)

This will search for Amazon and the negative lookahead ensures that there is no closing tag behind. 这将搜索亚马逊,负向前瞻确保后面没有结束标记。 And I search there only for not < so that I will not read a opening tag accidentally. 我搜索只存在不<所以,我不会阅读开放标签小心。

http://regexr.com http://regexr.com

Joe, resurrecting this question because it had a simple solution that wasn't mentioned. 乔,复活这个问题,因为它有一个没有提到的简单解决方案。 (Found your question while doing some research for a general question about how to exclude patterns in regex .) (在对有关如何排除正则表达式中的模式的一般问题进行一些研究时找到了您的问题。)

With all the disclaimers about using regex to parse html, here is a simple way to do it. 有了所有关于使用正则表达式解析html的免责声明,这是一个简单的方法。

Here's our simple regex: 这是我们简单的正则表达式:

<a.*?</a>(*SKIP)(*F)|amazon

The left side of the alternation matches complete <a... </a> tags, then deliberately fails. 交替的左侧匹配完整<a... </a> </ a>标签,然后故意失败。 The right side matches amazon , and we know this is the right amazon because it was not matched by the expression on the left. 右侧与amazon相匹配,我们知道这是正确的amazon因为它与左侧的表达不匹配。

This program shows how to use the regex (see the results at the bottom of the online demo ): 该程序显示了如何使用正则表达式(请参阅在线演示底部的结果):

<?php
$target = "word1 <a stuff amazon> </a> word2 amazon";
$regex = "~(?i)<a.*?</a>(*SKIP)(*F)|amazon~";
$repl= '<a href="http://www.amazon.com">Amazon</a>';
$new=preg_replace($regex,$repl,$target);
echo htmlentities($new);

Reference 参考

How to match (or replace) a pattern except in situations s1, s2, s3... 如何匹配(或替换)模式除了情况s1,s2,s3 ......

Unfortunately I think the logic you need is still more complex than text pattern matching :-/ 不幸的是,我认为你需要的逻辑仍然比文本模式匹配更复杂: - /

I know it's not the answer you want to hear, but you'll probably get better results with a DOM model. 我知道这不是你想听到的答案,但你可能会用DOM模型获得更好的结果。

Here's a discussion of this topic elsewhere: http://coderzone.org/forum/index.php?topic=84.0 以下是其他地方对此主题的讨论: http//coderzone.org/forum/index.php? topic = 84.0

Is it possible to just run the filter once, so you don't end up with dupes? 是否可以只运行一次过滤器,所以你最终不会使用欺骗手段? Or could the original corpus also include links? 或者原始语料库是否也包含链接?

Improvisation. 即兴。 It should link only if it is a whole word "Amazon" and not words like AmazonWorld. 只有当它是一个完整的单词“亚马逊”而不是像AmazonWorld这样的单词时,它才应该链接。

$result = preg_replace('%\bAmazon(?![^<]*</a>)\b%i', '<a href="http://www.amazon.com">Amazon</a>', $subject);

Use this code: 使用此代码:

$p = '~((<a\s)(?(2)[^>]*?>))?(amazon)~smi';

$str = '<a href="http://www.amazon.com">Amazon</a>';

$s = preg_replace($p, "$1My $3 Link", $str);
var_dump($s);

OUTPUT OUTPUT

String(50) "<a href="http://www.amazon.com">My Amazon Link</a>"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM