简体   繁体   English

正则表达式在字符串的开头处未命中匹配

[英]Regular expression misses match at beginning of string

I have strings of as and bs. 我有as和bs的字符串。 I want to extract all overlapping subsequences, where a subsequence is a single a surrounding by any number of bs. 我想提取所有重叠的子序列,其中子序列是由任意数量的b围绕的单个a。 This is the regex I wrote: 这是我写的正则表达式:

import re

pattern = """(?=            # inside lookahead for overlapping results
             (?:a|^)        # match at beginning of str or after a
             (b* (?:a) b*)  # one a between any number of bs
             (?:a|$))       # at end of str or before next a
          """
a_between_bs = re.compile(pattern, re.VERBOSE)

It seems to work as expected, except when the very first character in the string is an a, in which case this subsequence is missed: 它似乎按预期工作,除非字符串中的第一个字符是a,在这种情况下,这个子序列被遗漏:

a_between_bs.findall("bbabbba")
# ['bbabbb', 'bbba']
a_between_bs.findall("abbabb")
# ['bbabb']

I don't understand what is happening. 我不明白发生了什么。 If I change the order of how a potential match could start, the results also change: 如果我更改潜在匹配的开始顺序,结果也会发生变化:

pattern = """(?=
             (?:^|a)        # a and ^ swapped
             (b* (?:a) b*)
             (?:a|$))
          """
a_between_bs = re.compile(pattern, re.VERBOSE)

a_between_bs.findall("abbabb")
# ['abb']

I would have expected this to be symmetric, so that strings ending in an a might also be missed, but this doesn't appear to be the case. 我原以为这是对称的,因此以a结尾的字符串也可能会被遗漏,但事实并非如此。 What is going on? 到底是怎么回事?

Edit : 编辑

I assumed that solutions to the toy example above would translate to my full problem, but that doesn't seem to be the case, so I'm elaborating now (sorry about that). 我认为上面的玩具示例的解决方案将转化为我的完整问题,但似乎并非如此,所以我现在正在详细阐述(抱歉)。 I am trying to extract "syllables" from transcribed words. 我试图从转录的单词中提取“音节”。 A "syllable" is a vowel or a diphtongue , preceded and followed by any number of consonants. “音节”是元音或双音 ,前后是任意数量的辅音。 This is my regular expression to extract them: 这是我提取它们的正则表达式:

vowels = 'æɑəɛiɪɔuʊʌ'
diphtongues = "|".join(('aj', 'aw', 'ej', 'oj', 'ow'))
consonants = 'θwlmvhpɡŋszbkʃɹdnʒjtðf'

pattern = f"""(?=
          (?:[{vowels}]|^|{diphtongues})
          ([{consonants}]* (?:[{vowels}]|{diphtongues}) [{consonants}]*)
          (?:[{vowels}]|$|{diphtongues})
          )
          """
syllables = re.compile(pattern, re.VERBOSE)

The tricky bit is that the diphtongues end in consonants (j or w), which I don't want to be included in the next syllable. 棘手的一点是,diphtongues以辅音(j或w)结尾,我不希望将其包括在下一个音节中。 So replacing the first non-capturing group by a double negative (?<![{consonants}]) doesn't work. 因此,用双阴影(?<![{consonants}])替换第一个非捕获组是行不通的。 I tried to instead replace that group by a positive lookahead (?<=[{vowels}]|^|{diphtongues}) , but regex won't accept different lengths (even removing the diphtongues doesn't work, apparently ^ is of a different length). 我试图用积极的前瞻(?<=[{vowels}]|^|{diphtongues})代替那个组,但是正则表达式不会接受不同的长度(即使删除diphtongues也行不通,显然^是不同的长度)。

So this is the problematic case with the pattern above: 所以这是上述模式的问题:

syllables.findall('æbə')
# ['bə'] 
# should be: ['æb', 'bə']

Edit 2: I've switched to using regex, which allows variable-width lookbehinds, which solves the problem. 编辑2:我已经切换到使用正则表达式,它允许可变宽度的lookbehinds,它解决了这个问题。 To my surprise, it even appears to be faster than the re module in the standard library. 令我惊讶的是,它甚至比标准库中的re模块更快。 I'd still like to know how to get this working with the re module, though. 不过,我仍然想知道如何使用re模块。 (: (:

I suggest fixing this with a double negation: 我建议用双重否定来解决这个问题:

(?=         # inside lookahead for overlapping results
 (?<![^a])  # match at beginning of str or after a
 (b*ab*)    # one a between any number of bs
 (?![^a])   # at end of str or before next a
)

See the regex demo 请参阅正则表达式演示

Note I replaced the grouping constructs with lookarounds : (?:a|^) with (?<![^a]) and (?:a|$) with (?![^a]) . 注意我用lookarounds替换了分组结构:( ?: (?:a|^) with (?<![^a])(?:a|$) with (?![^a]) The latter is not really important, but the first is very important here. 后者并不重要,但第一个在这里非常重要。

The (?:a|^) at the beginning of the outer lookahead pattern matches a or start of the string, whatever comes first. 外部先行模式开始处的(?:a|^)匹配字符串的aa开头,无论先(?:a|^)什么。 If a is at the start, it is matched and when the input is abbabb , you get bbabb since it matches the capturing group pattern and there is an end of string position right after. 如果a位于开头,则匹配,当输入为abbabb ,您将获得bbabb因为它与捕获组模式匹配,并且bbabb有一个字符串位置结束。 The next iteration starts after the first a , and cannot find any match since the only a left in the string has no a after b s. 下一次迭代后的第一个启动a ,并不能找到任何匹配,因为唯一a留在字符串中没有ab秒。

Note that order of alternative matters . 注意替代事项的顺序 If you change to (?:^|a) , the match starts at the start of the string, b* matches empty string, ab* grabs the first abb in abbabb , and since there is a right after, you get abb as a match. 如果改为(?:^|a) ,则匹配从字符串的开头开始, b*匹配空字符串, ab*抓取abbabb的第一个abb ,并且因为后面有a ,所以你得到abb作为比赛。 There is no way to match anything after the first a . 在第a之后没有办法匹配任何东西。

Remember that python "short-circuits", so, if it matches "^", its not going to continue looking to see if it matches "a" too. 请记住,python“短路”,因此,如果匹配“^”,它将不会继续查看它是否也匹配“a”。 This will "consume" the matching character, so in cases where it matches "a", "a" is consumed and not available for the next group to match, and because using the (?:) syntax is non-capturing, that "a" is "lost", and not available to be captured by the next grouping (b*(?:a)b*), whereas when "^" is consumed by the first grouping, that first "a" would match in the second grouping. 这将“消耗”匹配的字符,因此在匹配“a”的情况下,“a”被消耗并且不可用于下一个要匹配的组,并且因为使用(?:)语法是非捕获的,所以“ a“丢失”,并且不能被下一个分组捕获(b *(?:a)b *),而当第一个分组消耗“^”时,第一个“a”将匹配第二次分组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM