简体   繁体   中英

Regular expression misses match at beginning of string

I have strings of as and bs. I want to extract all overlapping subsequences, where a subsequence is a single a surrounding by any number of bs. This is the regex I wrote:

import re

pattern = """(?=            # inside lookahead for overlapping results
             (?:a|^)        # match at beginning of str or after a
             (b* (?:a) b*)  # one a between any number of bs
             (?:a|$))       # at end of str or before next a
          """
a_between_bs = re.compile(pattern, re.VERBOSE)

It seems to work as expected, except when the very first character in the string is an a, in which case this subsequence is missed:

a_between_bs.findall("bbabbba")
# ['bbabbb', 'bbba']
a_between_bs.findall("abbabb")
# ['bbabb']

I don't understand what is happening. If I change the order of how a potential match could start, the results also change:

pattern = """(?=
             (?:^|a)        # a and ^ swapped
             (b* (?:a) b*)
             (?:a|$))
          """
a_between_bs = re.compile(pattern, re.VERBOSE)

a_between_bs.findall("abbabb")
# ['abb']

I would have expected this to be symmetric, so that strings ending in an a might also be missed, but this doesn't appear to be the case. What is going on?

Edit :

I assumed that solutions to the toy example above would translate to my full problem, but that doesn't seem to be the case, so I'm elaborating now (sorry about that). I am trying to extract "syllables" from transcribed words. A "syllable" is a vowel or a diphtongue , preceded and followed by any number of consonants. This is my regular expression to extract them:

vowels = 'æɑəɛiɪɔuʊʌ'
diphtongues = "|".join(('aj', 'aw', 'ej', 'oj', 'ow'))
consonants = 'θwlmvhpɡŋszbkʃɹdnʒjtðf'

pattern = f"""(?=
          (?:[{vowels}]|^|{diphtongues})
          ([{consonants}]* (?:[{vowels}]|{diphtongues}) [{consonants}]*)
          (?:[{vowels}]|$|{diphtongues})
          )
          """
syllables = re.compile(pattern, re.VERBOSE)

The tricky bit is that the diphtongues end in consonants (j or w), which I don't want to be included in the next syllable. So replacing the first non-capturing group by a double negative (?<![{consonants}]) doesn't work. I tried to instead replace that group by a positive lookahead (?<=[{vowels}]|^|{diphtongues}) , but regex won't accept different lengths (even removing the diphtongues doesn't work, apparently ^ is of a different length).

So this is the problematic case with the pattern above:

syllables.findall('æbə')
# ['bə'] 
# should be: ['æb', 'bə']

Edit 2: I've switched to using regex, which allows variable-width lookbehinds, which solves the problem. To my surprise, it even appears to be faster than the re module in the standard library. I'd still like to know how to get this working with the re module, though. (:

I suggest fixing this with a double negation:

(?=         # inside lookahead for overlapping results
 (?<![^a])  # match at beginning of str or after a
 (b*ab*)    # one a between any number of bs
 (?![^a])   # at end of str or before next a
)

See the regex demo

Note I replaced the grouping constructs with lookarounds : (?:a|^) with (?<![^a]) and (?:a|$) with (?![^a]) . The latter is not really important, but the first is very important here.

The (?:a|^) at the beginning of the outer lookahead pattern matches a or start of the string, whatever comes first. If a is at the start, it is matched and when the input is abbabb , you get bbabb since it matches the capturing group pattern and there is an end of string position right after. The next iteration starts after the first a , and cannot find any match since the only a left in the string has no a after b s.

Note that order of alternative matters . If you change to (?:^|a) , the match starts at the start of the string, b* matches empty string, ab* grabs the first abb in abbabb , and since there is a right after, you get abb as a match. There is no way to match anything after the first a .

Remember that python "short-circuits", so, if it matches "^", its not going to continue looking to see if it matches "a" too. This will "consume" the matching character, so in cases where it matches "a", "a" is consumed and not available for the next group to match, and because using the (?:) syntax is non-capturing, that "a" is "lost", and not available to be captured by the next grouping (b*(?:a)b*), whereas when "^" is consumed by the first grouping, that first "a" would match in the second grouping.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM