简体   繁体   English

PHP Regex匹配字符串中的单词,但不包括一个特定单词

[英]PHP Regex match words in a string excluding one specific word

I have a text ($txt), an array of words ($words) i want to add a link and a word ($wordToExclude) that must be not replaced. 我有一个文本($ txt),一个要添加链接的单词数组($ words)和一个不能替换的单词($ wordToExclude)。

$words = array ('adipiscing','molestie','fringilla');
$wordToExclude = 'consectetur adipiscing';


$txt = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque
mattis tincidunt dolor sed consequat. Sed rutrum, mauris convallis bibendum 
dignissim, ligula sem molestie massa, vitae condimentum neque sem non tellus.
Aenean dolor enim, cursus vel sodales ac, condimentum ac erat. Quisque
lobortis libero nec arcu fringilla imperdiet. Pellentesque commodo, 
arcu et dictum tincidunt, ipsum elit molestie ipsum, ut ultricies nisl
neque in velit. Curabitur luctus dui id urna consequat vitae mattis
turpis pretium. Donec nec adipiscing velit.'

I want to obtain this result: 我想获得以下结果:

$txt = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque
mattis tincidunt dolor sed consequat. Sed rutrum, mauris convallis bibendum 
dignissim, ligula sem <a href="#">molestie</a> massa, vitae condimentum neque sem non tellus.
Aenean dolor enim, cursus vel sodales ac, condimentum ac erat. Quisque
lobortis libero nec arcu <a href="#">fringilla</a> imperdiet. Pellentesque commodo, 
arcu et dictum tincidunt, ipsum elit <a href="#">molestie</a> ipsum, ut ultricies nisl
neque in velit. Curabitur luctus dui id urna consequat vitae mattis
turpis pretium. Donec nec <a href="#">adipiscing</a> velit.'
$result = preg_replace(
    '/\b                 # Word boundary
    (                    # Match one of the following:
     (?<!consectetur\s)  #  (unless preceded by "consectetur "
     adipiscing          #  adipiscing
    |                    # or
     molestie            #  molestie
    |                    # etc.
     fringilla
    )                    # End of alternation
    \b                   # Word boundary
    /ix', 
    '<a href="#">\1</a>', $subject);

Okie doke! Okie doke! While I think this is technically doable, the solutions I have provided are kind of soft at this point: 尽管我认为这在技术上是可行的,但我提供的解决方案在这一点上还是比较软的:

s%(?!consectetur adipiscing)(adipiscing|molestie|fringilla)(?<!consectetur adipiscing)%<a href="#LinkBasedUpon$1">$1</a>%s

turns... 变成...

sit amet, consectetur adipiscing elit. 坐着, 奉献自若 Quisque... ligula sem molestie massa... nec arcu fringilla imperdiet... nec adipiscing velit. Quisque ... ligula sem molestie massa ... nec arcu fringilla imperdiet ... nec adipiscing velit。

into... 进入...

sit amet, consectetur adipiscing elit. 坐着,奉献自若。 Quisque... ligula sem <a href="#LinkBasedUponmolestie"> molestie </a> massa... nec arcu <a href="#LinkBasedUponfringilla"> fringilla </a> imperdiet... nec <a href="#LinkBasedUponadipiscing"> adipiscing </a> velit. Quisque ... ligula sem <a href="#LinkBasedUponmolestie"> s鼠</a> massa ... nec arcu <a href="#LinkBasedUponfringilla"> fringilla </a>不当... nec <a href="#LinkBasedUponadipiscing"> adipiscing </a>天鹅绒

The reason it is a soft solution is that it does not handle partial words or other cases where the word(s) to exclude do not either begin or end with one of the words to be matched. 之所以是软解决方案,是因为它不处理部分单词,或者其他情况下要排除的单词不是以要匹配的单词之一开头或结尾。 eg, if we were to append to the excluded 'word' (ie consectetur adipiscing elit ), this expression would end up matching the adipiscing in consectetur adipiscing elit , because adipiscing does not begin or end the same as consectetur adipiscing elit 例如,如果我们要附加到排除的“单词”(即consectetur adipiscing elit ),则该表达式最终将与adipiscing中的consectetur adipiscing elit匹配,因为adipiscingconsectetur adipiscing elit相同或不同。

It should work as long as your exclude 'word' ( ABC ) always ends or begins with one of the words to be found ( C|X|E has a C in it, and ABC ends with the word C , so should therefore work...) 只要您的排除“单词”( ABC )始终以要找到的单词之一结尾或开头( C|X|E中包含CABC以单词C结尾),它就应该起作用。 ...)

EDIT { 编辑{

The reason the 'not matched' words must begin or end with one of the matched words is that this solution uses negative lookahead before the match, and negative lookbehind after the match to ensure that the matched sequence does not match the words to not be matched (does that make sense?) “不匹配”单词必须以匹配单词之一开头或结尾的原因是,此解决方案在匹配之前使用否定先行,在匹配之后使用否定后退,以确保匹配的序列与不匹配的单词不匹配(那有意义吗?)

} }

There are certain solutions to this, but they are either or both processor and programming effort intensive, and get exponentially more so depending on the size of the lists of words and the length of the searched text AND the specific requirements - and you never specified anything else, so I'm not gonna go into it at this point. 有一些解决方案,但是它们要么是处理器,要么是程序和程序,它们的工作量很大,或者成倍地增加,这取决于单词列表的大小和所搜索文本的长度以及特定的要求,而您从未指定任何内容否则,我现在不打算讨论它。 Let me know if this is good enough for your situation! 让我知道这是否足以满足您的情况!

I see you're doing it in PHP. 我看到您正在用PHP执行此操作。 I understand you have an ARRAY of words to find in a text and you need to replace those with links. 我了解您在文本中可以找到单词数组,因此需要用链接替换它们。 Also you have ONE string that needs to be excluded when doing the replacing. 另外,替换时需要排除一个字符串。 Maybe instead of writing cool and clean yet complicated regular expressions what about this practical albeit probably not the nicest solution: 也许不用编写简洁明了而又复杂的正则表达式,尽管可能不是最好的解决方案,但这种实用的方法又如何呢?

You split the task into subtasks: 您将任务分为多个子任务:

  1. use preg_match_all to find offsets of all occurrences of the excluded string (you know the string length ( strlen ) and with the PREG_OFFSET_CAPTURE flag for preg_match_all you will figure out exact starts and ends - if there are more than one) 使用preg_match_all查找所有出现的被排除字符串的偏移量(您知道字符串长度( strlen ),并使用preg_match_allPREG_OFFSET_CAPTURE标志,您将确定确切的开始和结束-如果有多个以上)
  2. do foreach on your word list and again use preg_match_all to get all occurrences of the words you need to replace with links 在单词列表上进行foreach,然后再次使用preg_match_all获取所有需要替换为链接的单词
  3. compare the positions you found in step 2 with those found in step 1 and if they're outside do the replace or skip if you get overlap 比较您在第2步中找到的位置与在第1步中找到的位置,如果它们不在外面,请进行替换或跳过(如果您发现重叠)

It surely won't be a one-liner but would be quite easy to code and then probably quite easy to read later too. 它肯定不会成为一线书,但是很容易编写代码,以后可能也很容易阅读。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM