简体   繁体   中英

PHP Regex match words in a string excluding one specific word

I have a text ($txt), an array of words ($words) i want to add a link and a word ($wordToExclude) that must be not replaced.

$words = array ('adipiscing','molestie','fringilla');
$wordToExclude = 'consectetur adipiscing';


$txt = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque
mattis tincidunt dolor sed consequat. Sed rutrum, mauris convallis bibendum 
dignissim, ligula sem molestie massa, vitae condimentum neque sem non tellus.
Aenean dolor enim, cursus vel sodales ac, condimentum ac erat. Quisque
lobortis libero nec arcu fringilla imperdiet. Pellentesque commodo, 
arcu et dictum tincidunt, ipsum elit molestie ipsum, ut ultricies nisl
neque in velit. Curabitur luctus dui id urna consequat vitae mattis
turpis pretium. Donec nec adipiscing velit.'

I want to obtain this result:

$txt = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque
mattis tincidunt dolor sed consequat. Sed rutrum, mauris convallis bibendum 
dignissim, ligula sem <a href="#">molestie</a> massa, vitae condimentum neque sem non tellus.
Aenean dolor enim, cursus vel sodales ac, condimentum ac erat. Quisque
lobortis libero nec arcu <a href="#">fringilla</a> imperdiet. Pellentesque commodo, 
arcu et dictum tincidunt, ipsum elit <a href="#">molestie</a> ipsum, ut ultricies nisl
neque in velit. Curabitur luctus dui id urna consequat vitae mattis
turpis pretium. Donec nec <a href="#">adipiscing</a> velit.'
$result = preg_replace(
    '/\b                 # Word boundary
    (                    # Match one of the following:
     (?<!consectetur\s)  #  (unless preceded by "consectetur "
     adipiscing          #  adipiscing
    |                    # or
     molestie            #  molestie
    |                    # etc.
     fringilla
    )                    # End of alternation
    \b                   # Word boundary
    /ix', 
    '<a href="#">\1</a>', $subject);

Okie doke! While I think this is technically doable, the solutions I have provided are kind of soft at this point:

s%(?!consectetur adipiscing)(adipiscing|molestie|fringilla)(?<!consectetur adipiscing)%<a href="#LinkBasedUpon$1">$1</a>%s

turns...

sit amet, consectetur adipiscing elit. Quisque... ligula sem molestie massa... nec arcu fringilla imperdiet... nec adipiscing velit.

into...

sit amet, consectetur adipiscing elit. Quisque... ligula sem <a href="#LinkBasedUponmolestie"> molestie </a> massa... nec arcu <a href="#LinkBasedUponfringilla"> fringilla </a> imperdiet... nec <a href="#LinkBasedUponadipiscing"> adipiscing </a> velit.

The reason it is a soft solution is that it does not handle partial words or other cases where the word(s) to exclude do not either begin or end with one of the words to be matched. eg, if we were to append to the excluded 'word' (ie consectetur adipiscing elit ), this expression would end up matching the adipiscing in consectetur adipiscing elit , because adipiscing does not begin or end the same as consectetur adipiscing elit

It should work as long as your exclude 'word' ( ABC ) always ends or begins with one of the words to be found ( C|X|E has a C in it, and ABC ends with the word C , so should therefore work...)

EDIT {

The reason the 'not matched' words must begin or end with one of the matched words is that this solution uses negative lookahead before the match, and negative lookbehind after the match to ensure that the matched sequence does not match the words to not be matched (does that make sense?)

}

There are certain solutions to this, but they are either or both processor and programming effort intensive, and get exponentially more so depending on the size of the lists of words and the length of the searched text AND the specific requirements - and you never specified anything else, so I'm not gonna go into it at this point. Let me know if this is good enough for your situation!

I see you're doing it in PHP. I understand you have an ARRAY of words to find in a text and you need to replace those with links. Also you have ONE string that needs to be excluded when doing the replacing. Maybe instead of writing cool and clean yet complicated regular expressions what about this practical albeit probably not the nicest solution:

You split the task into subtasks:

  1. use preg_match_all to find offsets of all occurrences of the excluded string (you know the string length ( strlen ) and with the PREG_OFFSET_CAPTURE flag for preg_match_all you will figure out exact starts and ends - if there are more than one)
  2. do foreach on your word list and again use preg_match_all to get all occurrences of the words you need to replace with links
  3. compare the positions you found in step 2 with those found in step 1 and if they're outside do the replace or skip if you get overlap

It surely won't be a one-liner but would be quite easy to code and then probably quite easy to read later too.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM