简体   繁体   English

python正则表达式:匹配多个正则表达式之一

[英]python regular expression: match either one of several regular expressions

I have a string and three patterns that I want to match and I use the python re package. 我有一个字符串和三个要匹配的模式,我使用python re包。 Specifically, if one of the pattern is found, output "Dislikes", otherwise, output "Likes". 具体来说,如果找到模式之一,则输出“不喜欢”,否则,输出“喜欢”。 Brief info about the three patterns: 有关这三种模式的简要信息:

pattern 1: check if all character in string is uppercase letter 模式1:检查字符串中的所有字符是否均为大写字母

pattern 2: check if consecutive character are the same, for example, AA , BB ... 模式2:检查连续字符是否相同,例如AABB ...

pattern3 : check if pattern XYXY exist, X and Y can be same and letters in this pattern do not need to be next to each other. pattern3:检查是否存在XYXY模式, XY可以相同,并且该模式中的字母不必彼此相邻。

When I write the pattern separately, the program runs as expected. 当我分别编写模式时,程序将按预期运行。 But when I combine the 3 patterns using alternation | 但是,当我使用交替组合3种模式时| , the result is wrong. ,结果是错误的。 I have check the stackoverflow post, for example, here and here . 我已经检查了stackoverflow帖子,例如, 在这里这里 Solution provided there do not work for me. 提供的解决方案对我不起作用。

Here is the original code that works fine: 这是可以正常工作的原始代码:

import sys
import re

if __name__ == "__main__":
    pattern1 = re.compile(r"[^A-Z]+")
    pattern2 = re.compile(r"([A-Z])\1")
    pattern3 = re.compile(r"([A-Z])[A-Z]*([A-Z])[A-Z]*\1[A-Z]*\2")

    word = sys.stdin.readline()
    word = word.rstrip('\n')
    if pattern1.search(word) or pattern2.search(word) or pattern3.search(word):
        print("Dislikes")
    else:
        print("Likes")

If I combine the 3 pattern to one using the following code, something is wrong: 如果我使用以下代码将3种模式组合为一种,则可能是错误的:

import sys
import re

if __name__ == "__main__":

    pattern = r"([A-Z])[A-Z]*([A-Z])[A-Z]*\1[A-Z]*\2|([A-Z])\1|[^A-Z]+"

    word = sys.stdin.readline()

    word = word.rstrip('\n')
    if re.search(word, pattern):
        print("Dislikes")
    else:
       print("Likes")

If we call the 3 patterns p1 , p2 , and p3 , I also tried the following combination: 如果我们将3种模式分别称为p1p2p3 ,我还尝试了以下组合:

pattern = r"(p1|p2|p3)"
pattern = r"(p1)|(p2)|(p3)"

But they also do not work as expected. 但是它们也无法按预期工作。 What is the correct to combine them? 结合它们的正确方法是什么?

Test cases: 测试用例:

  • "Likes": ABC , ABCD , A , ABCBA “喜欢”: ABCABCDAABCBA
  • "Dislikes": ABBC (pattern2), THETXH (pattern3), ABACADA (pattern3), AbCD (pattern1) “不喜欢”: ABBC (模式2), THETXH (pattern3), ABACADA (pattern3), AbCD (模式1)

Here is a single pattern that joins yours: 这是一个加入您的模式:

([^A-Z]+|([A-Z])\2|([A-Z])[A-Z]*([A-Z])[A-Z]*\3[A-Z]*\4)

So, why does it work? 那么,为什么行得通呢?

It consists of a simple (p1|p2|p3) pattern, where p1 , p2 and p3 are those you defined before: 它由一个简单的(p1|p2|p3)模式组成,其中p1p2p3是您之前定义的模式:

[^A-Z]+
([A-Z])\1
([A-Z])[A-Z]*([A-Z])[A-Z]*\1[A-Z]*\2

It can be decomposed as: 它可以分解为:

(
  [^A-Z]+
 |([A-Z])\2
 |([A-Z])[A-Z]*([A-Z])[A-Z]*\3[A-Z]*\
)

The problem you were encoutering is the numbering of the groups. 您遇到麻烦的问题是组的编号。

First off, when you combine p2 and p3 , both refer to \\1 , but the latter represents different things across the two patterns. 首先,当您组合p2p3 ,都引用\\1 ,但是后者在两种模式中表示不同的事物。 Therefore, p3 should become ...\\2...\\3 , since there is an additional group before. 因此, p3应该成为...\\2...\\3 ,因为之前还有一个附加组。

Furthermore, the group indices refered to by \\number are indexed in the order in which they are opened. 此外,由\\number引用的组索引按打开顺序进行索引。 As a consequence, the very first parenthesis, corresponding to the opening of the outer (...|...|...) , is counted as the first group, and \\1 will refer to it. 因此,与外部(...|...|...)的开口相对应的第一个括号被计为第一组, \\1将被引用为第一组。 Of course, this is not what you want. 当然,这不是您想要的。 But in addition, this gives you an error, because then, \\1 refers to a group that has not been closed yet, and thus not defined. 但是此外,这还会给您带来错误,因为\\1指向尚未关闭的组,因此尚未定义。

Therefore, the indices should be shifted by one, becoming \\2 , \\3 and \\4 . 因此,索引应移位一个,分别变为\\2\\3\\4

Such A|B regexes are usually nested into parentheses, but the outer ones could actually be dropped, and the indices shifted back by one: 此类A|B表达式通常嵌套在括号中,但实际上可以将其删除,并将索引移回一个:

[^A-Z]+|([A-Z])\1|([A-Z])[A-Z]*([A-Z])[A-Z]*\2[A-Z]*\3

Here is a small demonstration of this pattern: 这是此模式的一个小例子:

import sys
import re

if __name__ == "__main__":
    pattern1 = re.compile(r"[^A-Z]+")
    pattern2 = re.compile(r"([A-Z])\1")
    pattern3 = re.compile(r"([A-Z])[A-Z]*([A-Z])[A-Z]*\1[A-Z]*\2")    
    pattern = re.compile(r"([^A-Z]+|([A-Z])\2|([A-Z])[A-Z]*([A-Z])[A-Z]*\3[A-Z]*\4)")

    while True:
        try:
            word = input("> ")
            print(pattern1.search(word))
            print(pattern2.search(word))
            print(pattern3.search(word))
            print(pattern.search(word))
        except Exception as error:
            print(error)

Interactive session: 互动环节:

> ABC    # Matches no pattern
None
None
None
None

> ABCBA  # Matches no pattern
None
None
None
None

> ABBC   # Matches p2
None
<_sre.SRE_Match object; span=(1, 3), match='BB'> # p2 is matched
None
<_sre.SRE_Match object; span=(1, 3), match='BB'> # Jointure gives the same match

> ABACADA # Matches p3
None
None
<_sre.SRE_Match object; span=(0, 7), match='ABACADA'> # p3 is matched
<_sre.SRE_Match object; span=(0, 7), match='ABACADA'> # Jointure gives the same match

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM