简体   繁体   中英

Negative Lookahead to match a string unless it appears in specific words.

I'm trying to find a way to determine if a line contains a particular string, while at the same time not matching if it occurs in certain words. I have this partially working, however it fails if one of the exclude words begins with the keyword.

So for example, this regex: ^((?!custom|onetomany|manytomany|atom|tomcat|tomorrow|automatic).)*(tom)

will successfully exclude all the words listed, with the exception of tomcat & tomorrow. I'm assuming this is because i am matching the keyword, so the lookahead is failing, but I'm not sure how to fix it.

Update: sadly, I haven't been able to figure this out unless you put the negative lookahead on both sides of the . in the non-capturing group:

^(?:(?!custom|onetomany|manytomany|atom|tomcat|tomorrow|automatic).(?!custom|onetomany|manytomany|atom|tomcat|tomorrow|automatic))*?(tom).*

Demo


It works if you move the . before your negative lookahead: .(?!...)

I would also make the * repetition lazy , so it doesn't need to backtrack as much (not always true, but is in this example). Also, if you want to match the entire line and only capture the instance of tom , make the group containing .(?!...) non-capturing and finish the expression with a greedy .* :

^(?:.(?!custom|onetomany|manytomany|atom|tomcat|tomorrow|automatic))*?(tom).*

Demo

This situation sounds straight out of Match (or replace) a pattern except in situations s1, s2, s3 etc .

Compared with other potential solutions, the regex couldn't be simpler:

custom|onetomany|manytomany|atom|tomcat|tomorrow|automatic|(tom)

If you want to show not just tom but the whole word it is in, such as tomahawk , change this to:

custom|onetomany|manytomany|atom|tomcat|tomorrow|automatic|(\w*tom\w*)

The left side of the alternation matches the words you don't want. We will ignore these matches. The right side matches and captures tom to Group 1, and we know they are the right tom because they were not matched by the expressions on the left.

This program shows how to use the regex (see the results at the bottom of the online demo ). It finds tom and tomahawk .

import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;

class Program {
public static void main (String[] args) throws java.lang.Exception  {

String subject = "custom onetomany manytomany atom tomcat tomorrow automatic tom tomahawk";
Pattern regex = Pattern.compile("custom|onetomany|manytomany|atom|tomcat|tomorrow|automatic|(\\w*tom\\w*)");
Matcher regexMatcher = regex.matcher(subject);
List<String> group1Caps = new ArrayList<String>();

// put Group 1 captures in a list
while (regexMatcher.find()) {
if(regexMatcher.group(1) != null) {
group1Caps.add(regexMatcher.group(1));
}
} // end of building the list

System.out.println("\n" + "*** Matches ***");
if(group1Caps.size()>0) {
for (String match : group1Caps) System.out.println(match);
}

} // end main
} // end Program

Reference

How to match (or replace) a pattern except in situations s1, s2, s3...

I think this is what you're after:

\b(?!(?:custom|onetomany|manytomany|atom|tomcat|tomorrow|automatic)\b)[a-z]*tom[a-z]*\b

I used a word boundary ( \\b ) instead of the anchor ( ^ ) so it will find the word anywhere, not just at the beginning. Adding another \\b to the end insures that it only matches complete words.

The \\b at the end of the lookahead subexpression does the same for the filtered words. For example, it won't match automatic , but it will match automatically .

Once the lookahead passes, [az]*tom[az]*\\b matches a word (or more accurately, a continuous sequence of letters) that contains the tom . I'm making a lot of simplifying assumptions so I can concentrate on the technique. Most importantly, if your "words" can contain non-word characters like hyphens ( - ) or apostrophes ( ' ), [az]* and \\b might not be good enough.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM