[英]Regular expression misses match at beginning of string
I have strings of as and bs. 我有as和bs的字符串。 I want to extract all overlapping subsequences, where a subsequence is a single a surrounding by any number of bs. 我想提取所有重叠的子序列,其中子序列是由任意数量的b围绕的单个a。 This is the regex I wrote: 这是我写的正则表达式:
import re
pattern = """(?= # inside lookahead for overlapping results
(?:a|^) # match at beginning of str or after a
(b* (?:a) b*) # one a between any number of bs
(?:a|$)) # at end of str or before next a
"""
a_between_bs = re.compile(pattern, re.VERBOSE)
It seems to work as expected, except when the very first character in the string is an a, in which case this subsequence is missed: 它似乎按预期工作,除非字符串中的第一个字符是a,在这种情况下,这个子序列被遗漏:
a_between_bs.findall("bbabbba")
# ['bbabbb', 'bbba']
a_between_bs.findall("abbabb")
# ['bbabb']
I don't understand what is happening. 我不明白发生了什么。 If I change the order of how a potential match could start, the results also change: 如果我更改潜在匹配的开始顺序,结果也会发生变化:
pattern = """(?=
(?:^|a) # a and ^ swapped
(b* (?:a) b*)
(?:a|$))
"""
a_between_bs = re.compile(pattern, re.VERBOSE)
a_between_bs.findall("abbabb")
# ['abb']
I would have expected this to be symmetric, so that strings ending in an a might also be missed, but this doesn't appear to be the case. 我原以为这是对称的,因此以a结尾的字符串也可能会被遗漏,但事实并非如此。 What is going on? 到底是怎么回事?
Edit : 编辑 :
I assumed that solutions to the toy example above would translate to my full problem, but that doesn't seem to be the case, so I'm elaborating now (sorry about that). 我认为上面的玩具示例的解决方案将转化为我的完整问题,但似乎并非如此,所以我现在正在详细阐述(抱歉)。 I am trying to extract "syllables" from transcribed words. 我试图从转录的单词中提取“音节”。 A "syllable" is a vowel or a diphtongue , preceded and followed by any number of consonants. “音节”是元音或双音 ,前后是任意数量的辅音。 This is my regular expression to extract them: 这是我提取它们的正则表达式:
vowels = 'æɑəɛiɪɔuʊʌ'
diphtongues = "|".join(('aj', 'aw', 'ej', 'oj', 'ow'))
consonants = 'θwlmvhpɡŋszbkʃɹdnʒjtðf'
pattern = f"""(?=
(?:[{vowels}]|^|{diphtongues})
([{consonants}]* (?:[{vowels}]|{diphtongues}) [{consonants}]*)
(?:[{vowels}]|$|{diphtongues})
)
"""
syllables = re.compile(pattern, re.VERBOSE)
The tricky bit is that the diphtongues end in consonants (j or w), which I don't want to be included in the next syllable. 棘手的一点是,diphtongues以辅音(j或w)结尾,我不希望将其包括在下一个音节中。 So replacing the first non-capturing group by a double negative (?<![{consonants}])
doesn't work. 因此,用双阴影(?<![{consonants}])
替换第一个非捕获组是行不通的。 I tried to instead replace that group by a positive lookahead (?<=[{vowels}]|^|{diphtongues})
, but regex won't accept different lengths (even removing the diphtongues doesn't work, apparently ^
is of a different length). 我试图用积极的前瞻(?<=[{vowels}]|^|{diphtongues})
代替那个组,但是正则表达式不会接受不同的长度(即使删除diphtongues也行不通,显然^
是不同的长度)。
So this is the problematic case with the pattern above: 所以这是上述模式的问题:
syllables.findall('æbə')
# ['bə']
# should be: ['æb', 'bə']
Edit 2: I've switched to using regex, which allows variable-width lookbehinds, which solves the problem. 编辑2:我已经切换到使用正则表达式,它允许可变宽度的lookbehinds,它解决了这个问题。 To my surprise, it even appears to be faster than the re module in the standard library. 令我惊讶的是,它甚至比标准库中的re模块更快。 I'd still like to know how to get this working with the re module, though. 不过,我仍然想知道如何使用re模块。 (: (:
I suggest fixing this with a double negation: 我建议用双重否定来解决这个问题:
(?= # inside lookahead for overlapping results
(?<![^a]) # match at beginning of str or after a
(b*ab*) # one a between any number of bs
(?![^a]) # at end of str or before next a
)
See the regex demo 请参阅正则表达式演示
Note I replaced the grouping constructs with lookarounds : (?:a|^)
with (?<![^a])
and (?:a|$)
with (?![^a])
. 注意我用lookarounds替换了分组结构:( ?: (?:a|^)
with (?<![^a])
和(?:a|$)
with (?![^a])
。 The latter is not really important, but the first is very important here. 后者并不重要,但第一个在这里非常重要。
The (?:a|^)
at the beginning of the outer lookahead pattern matches a
or start of the string, whatever comes first. 外部先行模式开始处的(?:a|^)
匹配字符串的a
或a
开头,无论先(?:a|^)
什么。 If a
is at the start, it is matched and when the input is abbabb
, you get bbabb
since it matches the capturing group pattern and there is an end of string position right after. 如果a
位于开头,则匹配,当输入为abbabb
,您将获得bbabb
因为它与捕获组模式匹配,并且bbabb
有一个字符串位置结束。 The next iteration starts after the first a
, and cannot find any match since the only a
left in the string has no a
after b
s. 下一次迭代后的第一个启动a
,并不能找到任何匹配,因为唯一a
留在字符串中没有a
后b
秒。
Note that order of alternative matters . 注意替代事项的顺序 。 If you change to (?:^|a)
, the match starts at the start of the string, b*
matches empty string, ab*
grabs the first abb
in abbabb
, and since there is a
right after, you get abb
as a match. 如果改为(?:^|a)
,则匹配从字符串的开头开始, b*
匹配空字符串, ab*
抓取abbabb
的第一个abb
,并且因为后面有a
,所以你得到abb
作为比赛。 There is no way to match anything after the first a
. 在第a
之后没有办法匹配任何东西。
Remember that python "short-circuits", so, if it matches "^", its not going to continue looking to see if it matches "a" too. 请记住,python“短路”,因此,如果匹配“^”,它将不会继续查看它是否也匹配“a”。 This will "consume" the matching character, so in cases where it matches "a", "a" is consumed and not available for the next group to match, and because using the (?:) syntax is non-capturing, that "a" is "lost", and not available to be captured by the next grouping (b*(?:a)b*), whereas when "^" is consumed by the first grouping, that first "a" would match in the second grouping. 这将“消耗”匹配的字符,因此在匹配“a”的情况下,“a”被消耗并且不可用于下一个要匹配的组,并且因为使用(?:)语法是非捕获的,所以“ a“丢失”,并且不能被下一个分组捕获(b *(?:a)b *),而当第一个分组消耗“^”时,第一个“a”将匹配第二次分组。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.