简体   繁体   中英

Java regex intersection with backreference

I am trying to create a regex in Java to match the pattern of a particular word to find other words with the same pattern. For example, the word "tooth" has the pattern 12213 since both the 't' and 'o' repeat. I would want the regex to match other words like "teeth".

So here's my attempt using backreferences. In this particular example, it should fail if the second letter is the same as the first letter. Also, the last letter should be different from all the rest.

String regex = "([a-z])([a-z&&[^\1]])\\2\\1([a-z&&[^\1\2]])";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher("tooth");

//This works as expected
assertTrue(m.matches());

m.reset("tooto");
//This should return false, but instead returns true
assertFalse(m.matches());

I have verified that it works on examples like "toot" if I remove the last group, ie the following, so I know the backreferences are working up to this point:

String regex = ([a-z])([a-z&&[^\1]])\\2\\1";

But if I add back the last group to the end of the pattern, it's like it doesn't recognize the backreferences inside the square brackets anymore.

Am I doing something wrong, or is this a bug?

Try this:

(?i)\b(([a-z])(?!\2)([a-z])\3\2(?!\3)[a-z]+)\b

Explanation

(?i)           # Match the remainder of the regex with the options: case insensitive (i)
\b             # Assert position at a word boundary
(              # Match the regular expression below and capture its match into backreference number 1
   (              # Match the regular expression below and capture its match into backreference number 2
      [a-z]          # Match a single character in the range between “a” and “z”
   )
   (?!            # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
      \2             # Match the same text as most recently matched by capturing group number 2
   )
   (              # Match the regular expression below and capture its match into backreference number 3
      [a-z]          # Match a single character in the range between “a” and “z”
   )
   \3             # Match the same text as most recently matched by capturing group number 3
   \2             # Match the same text as most recently matched by capturing group number 2
   (?!            # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
      \3             # Match the same text as most recently matched by capturing group number 3
   )
   [a-z]          # Match a single character in the range between “a” and “z”
      +              # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
\b             # Assert position at a word boundary

Code

try {
    Pattern regex = Pattern.compile("(?i)\\b(([a-z])(?!\\2)([a-z])\\3\\2(?!\\3)[a-z]+)\\b");
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        for (int i = 1; i <= regexMatcher.groupCount(); i++) {
            // matched text: regexMatcher.group(i)
            // match start: regexMatcher.start(i)
            // match end: regexMatcher.end(i)
        }
    } 
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
}

See it playing here . Hope this helps.

If you print your regex you get a clue what is wrong, the backreferences in your groups are actually escaped by Java to produce some weird characters. Therefore it doesn't work as expected. For example:

m.reset("oooto");
System.out.println(m.matches());

also prints

true

Also, && doesn't work in regexes, you will have to use lookahead instead. This expression works for your example above:

String regex = "([a-z])(?!\\1)([a-z])\\2\\1(?!(\\1|\\2))[a-z]";

The expression (?!\\\\1) looks ahead to see that the next charachter isn't the first one in the expression, without moving the regex cursor forward.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM