简体   繁体   English

Python正则表达式-使用先前匹配的字符来匹配字符序列

[英]Python regex - matching character sequences using prior matched characters

I wish to match strings such as "zxxz" and "vbbv" where a character is followed by a pair of identical characters that do not match the first, then followed by the first. 我希望匹配诸如“ zxxz”和“ vbbv”之类的字符串,其中一个字符后跟一对与第一个不匹配的相同字符,然后是第一个不匹配。 Therefore I do not wish to match strings such as "zzzz" and "vvvv". 因此,我希望匹配“ zzzz”和“ vvvv”之类的字符串。

I started with the following Python regex that matches all of those examples: 我从与所有这些示例匹配的以下Python正则表达式开始:

(.)(.)\2\1

In an attempt to exclude the second set ("zzzz", "vvvv"), I tried this modification: 为了排除第二组(“ zzzz”,“ vvvv”),我尝试了以下修改:

(.)([^\1])\2\1

My reasoning is that the second group can contain any single character provided it is not the same at that matched in the first set. 我的理由是,第二组可以包含任何单个字符,只要它与第一组中的字符不同即可。

Unfortunately this does not seem to work as it still matches "zzzz" and "vvvv". 不幸的是,这似乎不起作用,因为它仍然与“ zzzz”和“ vvvv”匹配。

According to the Python 2.7.12 documentation: 根据Python 2.7.12文档:

\\number \\数

Matches the contents of the group of the same number. 匹配相同编号组的内容。 Groups are numbered starting from 1. For example, (.+) \\1 matches 'the the' or '55 55', but not 'thethe' (note the space after the group). 组从1开始编号。例如,(。+)\\ 1匹配“ the”或“ 55 55”,但不匹配“ thethe”(请注意组后的空格)。 This special sequence can only be used to match one of the first 99 groups. 此特殊序列只能用于匹配前99个组之一。 If the first digit of number is 0, or number is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value number. 如果数字的第一位数字为0,或者数字的长度为3个八进制数字,则不会被解释为组匹配,而是被解释为具有八进制值的字符。 Inside the '[' and ']' of a character class, all numeric escapes are treated as characters. 在字符类的[[和']'内部,所有数字转义符都被视为字符。

(My emphasis added). (我强调了)。

I find this sentence ambiguous, or at least unclear, because it suggests to me that the numeric escape should resolve as a single excluded character in the set, but this does not seem to happen. 我发现这句话模棱两可,或者至少不清楚,因为它向我暗示了数字转义应该解析为集合中单个排除的字符,但这似乎没有发生。

Additionally, the following regex does not seem to work as I would expect either: 此外,以下正则表达式似乎无法正常运行:

(.)[^\1][^\1][\1]

This doesn't seem to match "zzzz" or "zxxz". 这似乎与“ zzzz”或“ zxxz”不匹配。

You want to do a negative lookahead assertion (?!...) on \\1 in the second capture group, then it will work: 您想对第二个捕获组中的\\1做一个否定的超前断言(?!...) ,那么它将起作用:

r'(.)((?!\1).)\2\1'

Testing your examples: 测试您的示例:

>>> import re
>>> re.match(r'(.)((?!\1).)\2\1', 'zxxz')
<_sre.SRE_Match object at 0x109b661c8>
>>> re.match(r'(.)((?!\1).)\2\1', 'vbbv')
<_sre.SRE_Match object at 0x109b663e8>
>>> re.match(r'(.)((?!\1).)\2\1', 'zzzz') is None
True
>>> re.match(r'(.)((?!\1).)\2\1', 'vvvv') is None
True

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM