简体   繁体   中英

Standard Regex vs python regex discrepancy

I am reading a book and they provide an example of how to match a given string with regular expressions. Here is their example:

b*(abb*)*(a|∊) - Strings of a's and b's with no consecutive a's.

Now I've tried converting it to python like so:

>> p = re.compile(r'b*(abb*)*(a|)') # OR
>> p = re.compile(r'b*(abb*)*(a|\b)')

# BUT it still doesn't work
>>> p.match('aa')
<_sre.SRE_Match object at 0x7fd9ad028c68>

My question is two-fold:

  1. What is the equivalent of epsilon in python to make the above example work?
  2. Can someone explain to me why theoretical or standard way of doing regular expressions does not work in python? Might it have something to do with the longest vs shortest matching?

Clarification: For people asking what standard regex is - it is the formal language theory standard: http://en.wikipedia.org/wiki/Regular_expression#Formal_language_theory

Actually, the example works just fine ... to a small details. I would write:

>>> p = re.compile('b*(abb*)*a?')
>>> m = p.match('aa')
>>> print m.group(0)
'a'
>>> m = p.match('abbabbabababbabbbbbaaaaa')
>>> print m.group(0)
abbabbabababbabbbbba

Note that the group 0 returns the part of the string matched by the regular expression.

As you can see, the expression matches a succession of a and b without repetition of a. If indeed, you want to check the whole string, you need to changed slightly:

>>> p = re.compile('^b*(abb*)*a?$')
>>> m = p.match('aa')
>>> print m
None

the ^ and $ force recognition of the beginning and end of the string.

At last, you can combine both methods by using the first regular expression, but testing at the end:

>>> len(m.group(0)) == len('aa')

Added: For the second part of the OT, it seems to me there is no discrepancy between the standard regex and the python implementation. Of course, the notation is slightly different, and the python implementation suggest some extensions (as most other packages).

Thanks for the answers. I feel each answer had part of the answer. Here is what I was looking for.

  1. ? symbol is just a shorthand for (something|ε) . Thus (a|ε) can be rewritten as a? . So the example becomes:

     b*(abb*)*a? 

    In python we would write:

     p = re.compile(r'^b*(abb*)*a?$') 
  2. The reason straight translation of regular regular expression syntax into python (ie copy and paste) does not work is because python matches the shortest substring (if the symbols $ or ^ are absent) while the theoretical regular expressions match longest initial substring .
    So for example if we had a string:

     s = 'aa' 

    Our textbook regex b*(abb*)*a? would not match it because it has two a's. However if we copy it straight to python:

     >> p = re.compile(r'b*(abb*)*a?') >> bool(p.match(s)) True 

    This is because our regex matches only the substring 'a' of our string 'aa'.
    In order to tell python to do a match on the whole string we have to tell it where the beginning and the end of the string is, with the ^ and $ symbols respectively:

     >> p = re.compile(r'^b*(abb*)*a?$') >> bool(p.match(s)) False 

    Note that python regex match() matches at the beginning of the string, so it automatically assumes the ^ at the start. However the search() function does not, and thus we keep the ^ .
    So for example:

     >> s = 'aa' >> p = re.compile(r'b*(abb*)*a?$') >> bool(p.match(s)) False # Correct >> bool(p.search(s)) True # Incorrect - search ignored the first 'a' 

1

  • Use bool(p.match('aa')) to check if the regexp matches or not

  • p = re.compile('b*(abb*)*a?$')

  • \\b matches border of string; place between \\w and \\W (word characters and non-word characters)

2

Regexp is quite standard in python. Yet every language has some flavour of them, they are not 100% portable. There are minor differences which you're expected to lookup prior to using regexp in any specific language.

Addition

\\epsilon does not have special symbol in python. It is an empty character set.

In your example a|\\epsilon is equivalent to (a|) or just a? . After which $ is obligatory to match end of string.

I'm not exactly sure how match works in python, but I think you might need to add ^....$ to your RE. RegExp matching usually matches sub-strings, and it finds the largest match, in the case of p.match('aa') that's "a" (probably the first one). ^...$ makes sure that you're matching the ENTIRE string, which is I believe what you want.

Theoretical/standard reg exps assume that you're always matching the whole string, because you're using it to define a language of strings that match, not find a substring in an input string.

You're matching because your regex matches any zero-width segment of any specimen text. You need to anchor your regex. Here's one way of doing it, using a zero-width lookahead assertion:

re.compile(r'^(a(?!a)|b)*$')

Your second re should be an appropriate replacement for epsilon, as best as I understand it, though I've never seen epsilon in a regex before.

For what it's worth, your pattern is matching 'a'. That is to say, it is matching:

  • zero or more " b "s (choosing zero)
  • zero or more " (abb*) "s (choosing zero)
  • one " a " or word ending (choosing an a).

As Jonathan Feinberg pointed out, if you want to ensure the whole string matches, you have to anchor the beginning ( '^' ) and end ( '$' ) of your regex. You should also use a raw string whenever constructing regexes in python: r'my regex'. That will prevent excessive backslash escaping confusion.

the problem with your expression is that it matches the empty string, meaning that if you do:

>>> p = re.compile('b*(abb*)*(a|)')
>>> p.match('c').group(0)
''

and since re.match attempts to match the start of the string, you have to tell it to match it until the end of the string. just use $ for that

>>> p = re.compile(r'b*(abb*)*(a|)$')
>>> print p.match('c')
None
>>> p.match('ababababab').group(0)
'ababababab'

ps- you may have noted that i used r'pattern' instead of 'pattern' more on that here (first paragraphs)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM