简体   繁体   English

标准正则表达式与python正则表达式之间的差异

[英]Standard Regex vs python regex discrepancy

I am reading a book and they provide an example of how to match a given string with regular expressions. 我正在读一本书,它们提供了一个如何将给定字符串与正则表达式匹配的示例。 Here is their example: 这是他们的例子:

b*(abb*)*(a|∊) - Strings of a's and b's with no consecutive a's.

Now I've tried converting it to python like so: 现在,我尝试将其转换为python,如下所示:

>> p = re.compile(r'b*(abb*)*(a|)') # OR
>> p = re.compile(r'b*(abb*)*(a|\b)')

# BUT it still doesn't work
>>> p.match('aa')
<_sre.SRE_Match object at 0x7fd9ad028c68>

My question is two-fold: 我的问题有两个:

  1. What is the equivalent of epsilon in python to make the above example work? 使上面的示例正常工作,python中的epsilon等效于什么?
  2. Can someone explain to me why theoretical or standard way of doing regular expressions does not work in python? 有人可以向我解释为什么做正则表达式的理论或标准方法在python中不起作用吗? Might it have something to do with the longest vs shortest matching? 最长匹配与最短匹配可能有关吗?

Clarification: For people asking what standard regex is - it is the formal language theory standard: http://en.wikipedia.org/wiki/Regular_expression#Formal_language_theory 澄清:对于那些问什么标准正则表达式的人-这是形式语言理论标准: http : //en.wikipedia.org/wiki/Regular_expression#Formal_language_theory

Actually, the example works just fine ... to a small details. 实际上,该示例工作得很好……只是一个很小的细节。 I would write: 我会写:

>>> p = re.compile('b*(abb*)*a?')
>>> m = p.match('aa')
>>> print m.group(0)
'a'
>>> m = p.match('abbabbabababbabbbbbaaaaa')
>>> print m.group(0)
abbabbabababbabbbbba

Note that the group 0 returns the part of the string matched by the regular expression. 请注意,组0返回与正则表达式匹配的字符串部分。

As you can see, the expression matches a succession of a and b without repetition of a. 如您所见,该表达式匹配连续的a和b,而不重复a。 If indeed, you want to check the whole string, you need to changed slightly: 如果确实要检查整个字符串,则需要稍作更改:

>>> p = re.compile('^b*(abb*)*a?$')
>>> m = p.match('aa')
>>> print m
None

the ^ and $ force recognition of the beginning and end of the string. ^$强制识别字符串的开头和结尾。

At last, you can combine both methods by using the first regular expression, but testing at the end: 最后,您可以使用第一个正则表达式来组合这两种方法,但最后要进行测试:

>>> len(m.group(0)) == len('aa')

Added: For the second part of the OT, it seems to me there is no discrepancy between the standard regex and the python implementation. 补充:对于OT的第二部分,在我看来,标准正则表达式与python实现之间没有差异。 Of course, the notation is slightly different, and the python implementation suggest some extensions (as most other packages). 当然,表示法略有不同,并且python实现建议了一些扩展(与大多数其他软件包一样)。

Thanks for the answers. 感谢您的回答。 I feel each answer had part of the answer. 我觉得每个答案都有一部分答案。 Here is what I was looking for. 这是我一直在寻找的东西。

  1. ? symbol is just a shorthand for (something|ε) . 符号只是(something |ε)的简写。 Thus (a|ε) can be rewritten as a? 因此(a |ε)可以改写为a? . So the example becomes: 因此,示例变为:

     b*(abb*)*a? 

    In python we would write: 在python中,我们将编写:

     p = re.compile(r'^b*(abb*)*a?$') 
  2. The reason straight translation of regular regular expression syntax into python (ie copy and paste) does not work is because python matches the shortest substring (if the symbols $ or ^ are absent) while the theoretical regular expressions match longest initial substring . 将正则表达式语法直接转换为python(即复制和粘贴)不起作用的原因是,因为python匹配最短的子字符串 (如果没有$或^符号),而理论上的正则表达式匹配最长的初始子字符串
    So for example if we had a string: 例如,如果我们有一个字符串:

     s = 'aa' 

    Our textbook regex b*(abb*)*a? 我们的教科书正则表达式b *(abb *)* a? would not match it because it has two a's. 不会匹配,因为它有两个a。 However if we copy it straight to python: 但是,如果我们直接将其复制到python:

     >> p = re.compile(r'b*(abb*)*a?') >> bool(p.match(s)) True 

    This is because our regex matches only the substring 'a' of our string 'aa'. 这是因为我们的正则表达式仅匹配字符串“ aa”的子字符串“ a”。
    In order to tell python to do a match on the whole string we have to tell it where the beginning and the end of the string is, with the ^ and $ symbols respectively: 为了告诉python在整个字符串上进行匹配,我们必须告诉它字符串的开头和结尾在哪里,分别用^$符号:

     >> p = re.compile(r'^b*(abb*)*a?$') >> bool(p.match(s)) False 

    Note that python regex match() matches at the beginning of the string, so it automatically assumes the ^ at the start. 请注意,python regex match()在字符串的开头匹配,因此它会自动在开头假设^ However the search() function does not, and thus we keep the ^ . 但是search()函数没有,因此我们保留^
    So for example: 因此,例如:

     >> s = 'aa' >> p = re.compile(r'b*(abb*)*a?$') >> bool(p.match(s)) False # Correct >> bool(p.search(s)) True # Incorrect - search ignored the first 'a' 

1 1个

  • Use bool(p.match('aa')) to check if the regexp matches or not 使用bool(p.match('aa'))检查正则表达式是否匹配

  • p = re.compile('b*(abb*)*a?$')

  • \\b matches border of string; \\b匹配字符串的边框; place between \\w and \\W (word characters and non-word characters) \\w\\W之间的位置(单词字符和非单词字符)

2 2

Regexp is quite standard in python. 正则表达式在python中是非常标准的。 Yet every language has some flavour of them, they are not 100% portable. 但是每种语言都有它们的风格,它们不是100%可移植的。 There are minor differences which you're expected to lookup prior to using regexp in any specific language. 在使用任何特定语言的regexp之前,您应该先查找一些细微的差异。

Addition 加成

\\epsilon does not have special symbol in python. \\epsilon在python中没有特殊符号。 It is an empty character set. 它是一个空字符集。

In your example a|\\epsilon is equivalent to (a|) or just a? 在您的示例中a|\\epsilon等效于(a|)或只是a? . After which $ is obligatory to match end of string. 之后, $必须匹配字符串的结尾。

I'm not exactly sure how match works in python, but I think you might need to add ^....$ to your RE. 我不完全确定match如何在python中工作,但我认为您可能需要在您的RE中添加^ .... $。 RegExp matching usually matches sub-strings, and it finds the largest match, in the case of p.match('aa') that's "a" (probably the first one). RegExp匹配通常匹配子字符串,并且在p.match('aa')为“ a”(可能是第一个)的情况下找到最大的匹配项。 ^...$ makes sure that you're matching the ENTIRE string, which is I believe what you want. ^ ... $确保您匹配整个字符串,我相信这是您想要的。

Theoretical/standard reg exps assume that you're always matching the whole string, because you're using it to define a language of strings that match, not find a substring in an input string. 理论/标准正则表达式假定您始终匹配整个字符串,因为您使用它来定义匹配的字符串语言,而不是在输入字符串中找到子字符串。

You're matching because your regex matches any zero-width segment of any specimen text. 之所以匹配,是因为您的正则表达式匹配任何标本文本的任何零宽度段。 You need to anchor your regex. 您需要锚定正则表达式。 Here's one way of doing it, using a zero-width lookahead assertion: 这是一种使用零宽度超前断言的方法:

re.compile(r'^(a(?!a)|b)*$')

Your second re should be an appropriate replacement for epsilon, as best as I understand it, though I've never seen epsilon in a regex before. 据我所知,您的第二个回答应该是epsilon的合适替代品,尽管我以前从未在正则表达式中见过epsilon。

For what it's worth, your pattern is matching 'a'. 无论值多少,您的模式都匹配“ a”。 That is to say, it is matching: 也就是说,它是匹配的:

  • zero or more " b "s (choosing zero) 零个或多个“ b ”(选择零)
  • zero or more " (abb*) "s (choosing zero) 零个或多个“ (abb*) ”(选择零)
  • one " a " or word ending (choosing an a). 一个“ a ”或单词结尾(选择a)。

As Jonathan Feinberg pointed out, if you want to ensure the whole string matches, you have to anchor the beginning ( '^' ) and end ( '$' ) of your regex. 正如乔纳森·费恩伯格(Jonathan Feinberg)指出的那样,如果要确保整个字符串匹配,则必须锚定正则表达式的开头( '^' )和结尾( '$' )。 You should also use a raw string whenever constructing regexes in python: r'my regex'. 每当在python中构造正则表达式时,也应使用原始字符串:r'my regex'。 That will prevent excessive backslash escaping confusion. 这样可以防止过多的反斜杠避免混淆。

the problem with your expression is that it matches the empty string, meaning that if you do: 表达式的问题在于它与空字符串匹配,这意味着如果您这样做:

>>> p = re.compile('b*(abb*)*(a|)')
>>> p.match('c').group(0)
''

and since re.match attempts to match the start of the string, you have to tell it to match it until the end of the string. 并且由于re.match尝试匹配字符串的开头,因此您必须告诉它匹配它直到字符串的结尾。 just use $ for that 只需使用$

>>> p = re.compile(r'b*(abb*)*(a|)$')
>>> print p.match('c')
None
>>> p.match('ababababab').group(0)
'ababababab'

ps- you may have noted that i used r'pattern' instead of 'pattern' more on that here (first paragraphs) ps-您可能已经注意到,我在此处更多使用r'pattern'而不​​是'pattern'(第一段)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM