简体   繁体   中英

python regular expression: match either one of several regular expressions

I have a string and three patterns that I want to match and I use the python re package. Specifically, if one of the pattern is found, output "Dislikes", otherwise, output "Likes". Brief info about the three patterns:

pattern 1: check if all character in string is uppercase letter

pattern 2: check if consecutive character are the same, for example, AA , BB ...

pattern3 : check if pattern XYXY exist, X and Y can be same and letters in this pattern do not need to be next to each other.

When I write the pattern separately, the program runs as expected. But when I combine the 3 patterns using alternation | , the result is wrong. I have check the stackoverflow post, for example, here and here . Solution provided there do not work for me.

Here is the original code that works fine:

import sys
import re

if __name__ == "__main__":
    pattern1 = re.compile(r"[^A-Z]+")
    pattern2 = re.compile(r"([A-Z])\1")
    pattern3 = re.compile(r"([A-Z])[A-Z]*([A-Z])[A-Z]*\1[A-Z]*\2")

    word = sys.stdin.readline()
    word = word.rstrip('\n')
    if pattern1.search(word) or pattern2.search(word) or pattern3.search(word):
        print("Dislikes")
    else:
        print("Likes")

If I combine the 3 pattern to one using the following code, something is wrong:

import sys
import re

if __name__ == "__main__":

    pattern = r"([A-Z])[A-Z]*([A-Z])[A-Z]*\1[A-Z]*\2|([A-Z])\1|[^A-Z]+"

    word = sys.stdin.readline()

    word = word.rstrip('\n')
    if re.search(word, pattern):
        print("Dislikes")
    else:
       print("Likes")

If we call the 3 patterns p1 , p2 , and p3 , I also tried the following combination:

pattern = r"(p1|p2|p3)"
pattern = r"(p1)|(p2)|(p3)"

But they also do not work as expected. What is the correct to combine them?

Test cases:

  • "Likes": ABC , ABCD , A , ABCBA
  • "Dislikes": ABBC (pattern2), THETXH (pattern3), ABACADA (pattern3), AbCD (pattern1)

Here is a single pattern that joins yours:

([^A-Z]+|([A-Z])\2|([A-Z])[A-Z]*([A-Z])[A-Z]*\3[A-Z]*\4)

So, why does it work?

It consists of a simple (p1|p2|p3) pattern, where p1 , p2 and p3 are those you defined before:

[^A-Z]+
([A-Z])\1
([A-Z])[A-Z]*([A-Z])[A-Z]*\1[A-Z]*\2

It can be decomposed as:

(
  [^A-Z]+
 |([A-Z])\2
 |([A-Z])[A-Z]*([A-Z])[A-Z]*\3[A-Z]*\
)

The problem you were encoutering is the numbering of the groups.

First off, when you combine p2 and p3 , both refer to \\1 , but the latter represents different things across the two patterns. Therefore, p3 should become ...\\2...\\3 , since there is an additional group before.

Furthermore, the group indices refered to by \\number are indexed in the order in which they are opened. As a consequence, the very first parenthesis, corresponding to the opening of the outer (...|...|...) , is counted as the first group, and \\1 will refer to it. Of course, this is not what you want. But in addition, this gives you an error, because then, \\1 refers to a group that has not been closed yet, and thus not defined.

Therefore, the indices should be shifted by one, becoming \\2 , \\3 and \\4 .

Such A|B regexes are usually nested into parentheses, but the outer ones could actually be dropped, and the indices shifted back by one:

[^A-Z]+|([A-Z])\1|([A-Z])[A-Z]*([A-Z])[A-Z]*\2[A-Z]*\3

Here is a small demonstration of this pattern:

import sys
import re

if __name__ == "__main__":
    pattern1 = re.compile(r"[^A-Z]+")
    pattern2 = re.compile(r"([A-Z])\1")
    pattern3 = re.compile(r"([A-Z])[A-Z]*([A-Z])[A-Z]*\1[A-Z]*\2")    
    pattern = re.compile(r"([^A-Z]+|([A-Z])\2|([A-Z])[A-Z]*([A-Z])[A-Z]*\3[A-Z]*\4)")

    while True:
        try:
            word = input("> ")
            print(pattern1.search(word))
            print(pattern2.search(word))
            print(pattern3.search(word))
            print(pattern.search(word))
        except Exception as error:
            print(error)

Interactive session:

> ABC    # Matches no pattern
None
None
None
None

> ABCBA  # Matches no pattern
None
None
None
None

> ABBC   # Matches p2
None
<_sre.SRE_Match object; span=(1, 3), match='BB'> # p2 is matched
None
<_sre.SRE_Match object; span=(1, 3), match='BB'> # Jointure gives the same match

> ABACADA # Matches p3
None
None
<_sre.SRE_Match object; span=(0, 7), match='ABACADA'> # p3 is matched
<_sre.SRE_Match object; span=(0, 7), match='ABACADA'> # Jointure gives the same match

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM