简体   繁体   English

Python正则表达式:贪婪模式返回多个空匹配

[英]Python regex: greedy pattern returning multiple empty matches

This pattern is meant simply to grab everything in a string up until the first potential sentence boundary in the data: 这种模式只是为了捕获字符串中的所有内容,直到数据中的第一个潜在句子边界:

[^\.?!\r\n]*

Output: 输出:

>>> pattern = re.compile(r"([^\.?!\r\n]*)")
>>> matches = pattern.findall("Australians go hard!!!") # Actual source snippet, not a personal comment about Australians. :-)
>>> print matches
['Australians go hard', '', '', '', '']

From the Python documentation: 从Python文档:

re.findall(pattern, string, flags=0) re.findall(pattern,string,flags = 0)

Return all non-overlapping matches of pattern in string, as a list of strings. 返回字符串中pattern的所有非重叠匹配,作为字符串列表。 The string is scanned left-to-right, and matches are returned in the order found. 从左到右扫描字符串,并按找到的顺序返回匹配项。 If one or more groups are present in the pattern, return a list of groups; 如果模式中存在一个或多个组,则返回组列表; this will be a list of tuples if the pattern has more than one group. 如果模式有多个组,这将是一个元组列表。 Empty matches are included in the result unless they touch the beginning of another match. 结果中包含空匹配,除非它们触及另一个匹配的开头。

Now, if the string is scanned left to right and the * operator is greedy, it makes perfect sense that the first match returned is the whole string up to the exclamation marks. 现在,如果从左到右扫描字符串并且*运算符是贪婪的,那么返回的第一个匹配就是整个字符串直到感叹号,这是完全合理的。 However, after that portion has been consumed, I do not see how the pattern is producing an empty match exactly four times, presumably by scanning the string leftward after the "d". 然而,在该部分被消耗之后,我没有看到该模式如何正好产生四次空匹配,可能是通过在“d”之后向左扫描字符串。 I do understand that the * operator means this pattern can match the empty string, I just don't see how it would doing that more than once between the trailing "d" of the letters and the leading "!" 我明白*运算符意味着这个模式可以匹配空字符串,我只是看不到它在字母的尾随“d”和前导“!”之间不止一次这样做。 of the punctuation. 标点符号。

Adding the ^ anchor has this effect: 添加^锚具有以下效果:

>>> pattern = re.compile(r"^([^\.?!\r\n]*)")
>>> matches = pattern.findall("Australians go hard!!!")
>>> print matches
['Australians go hard']

Since this eliminates the empty string matches, it would seem to indicate that said empty matches were occurring before the leading "A" of the string. 由于这消除了空字符串匹配,它似乎表明所述空匹配发生在字符串的前导“A”之前。 But that would seem to contradict the documentation with respect to the matches being returned in the order found (matches before the leading "A" should have been first) and, again, exactly four empty matches baffles me. 但是这似乎与关于在找到的顺序中返回的匹配的文档相矛盾(在领先的“A”应该是第一个之前的匹配),并且再次恰好四个空的匹配让我困惑。

The * quantifier allows the pattern to capture a substring of length zero. *量词允许模式捕获长度为零的子串。 In your original code version (without the ^ anchor in front), the additional matches are: 在您的原始代码版本中(前面没有^锚点),其他匹配项是:

  • the zero-length string between the end of hard and the first ! hard和第一个结束之间的零长度字符串!
  • the zero-length string between the first and second ! 第一个和第二个之间的零长度字符串!
  • the zero-length string between the second and third ! 第二个和第三个之间的零长度字符串!
  • the zero-length string between the third ! 第三个之间的零长度字符串! and the end of the text 和文本的结尾

You can slice/dice this further if you like here . 您可以切片/骰子这进一步,如果你喜欢这里

Adding that ^ anchor to the front now ensures that only a single substring can match the pattern, since the beginning of the input text occurs exactly once. 现在将^ anchor添加到前面可确保只有一个子字符串可以匹配模式,因为输入文本的开头只出现一次。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM