正则表达式在Python中的混乱行为

Question

I'm trying to match a specific pattern using the re module in python. 我正在尝试使用python中的re模块匹配特定模式。 I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation) 我希望匹配一个完整的句子（更正确地说，我说的是它们是字母数字字符串序列，由空格和/或标点符号分隔）

Eg. 例如。

"This is a regular sentence." “这是一个普通的句子。”
"this is also valid" “这也是有效的”
"so is This ONE" “这也是”

I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still). 我尝试了各种正则表达式组合，但是我无法正确掌握模式的工作方式，每个表达式给我带来了不同却莫名其妙的结果（我承认我是个初学者，但仍然可以）。

I'm tried: 我试过了：

"((\\w+)(\\s?))*" “（（\\ w +）（\\ S'））*”
To the best of my knowledge this should match one or more alpha alphanumerics greedily followed by either one or no white-space character and then it should match this entire pattern greedily. 据我所知，它应该贪婪地匹配一个或多个字母数字，然后是一个或没有空格字符，然后它应该贪婪地匹配整个模式。 This is not what it seems to do, so clearly I am wrong but I would like to know why. 这似乎不是这样做的，所以很明显我错了，但我想知道为什么。 (I expected this to return the entire sentence as the result) The result I get for the first sample string mentioned above is [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')]. （我希望这会返回整个句子作为结果）对于上述第一个示例字符串，我得到的结果是[（'sentence'，'sentence'，''），（''，``，）），（“，”，“），（”，“，”））]。
"(\\w+ ?)*" “（\\ w +？）*”
I'm not even sure how this one should work. 我什至不知道这应该如何工作。 The official documentation(python help('re')) says that the ,+,? 官方文档（python help（'re'））表示，+ ,? Match x or x (greedy) repetitions of the preceding RE. 匹配先前RE的x或x（贪婪）重复。 In such a case is simply space the preceding RE for '?' 在这种情况下，只需在'RE'的前面加上RE or is '\\w+ ' the preceding RE? 还是'\\ w +'是前面的RE？ And what will be the RE for the ' ' operator? 而' '运算符的RE将是什么 ？ The output I get with this is ['sentence']. 我得到的输出是['entent']。
Others such as "(\\w+\\s?)+)" ; 其他如“（\\ w + \\ s？）+）”; "((\\w*)(\\s??)) etc. which are basically variation of the same idea that the sentence is a set of alpha numerics followed by a single/finite number of white spaces and this pattern is repeated over and over. “（（（\\ w *）（\\ s ??）））等，这基本上是同一观点的变体，即句子是一组字母数字，后跟一个/有限数量的空格，并且该模式在和过度。

Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to? 有人可以告诉我我哪里出了问题以及为什么，以及上面的表达式为什么不能按我期望的那样工作？

PS I eventually got "[ \\w]+" to work for me but With this I cannot limit the number of white-space characters in continuation. PS我最终得到了[[\\ w] +“为我工作，但是有了这个我不能限制连续的空白字符的数量。

Answer 1

Your reasoning about the regex is correct, your problem is coming from using capturing groups with * . 您对正则表达式的推论是正确的，您的问题来自使用带有*捕获组。 Here's an alternative: 这是一个替代方案：

>>> s="This is a regular sentence."
>>> import re
>>> re.findall(r'\w+\s?', s)
['This ', 'is ', 'a ', 'regular ', 'sentence']

In this case it might make more sense for you to use \\b in order to match word boundries. 在这种情况下，使用\\b匹配单词边界可能更有意义。

>>> re.findall(r'\w+\b', s)
['This', 'is', 'a', 'regular', 'sentence']

Alternatively you can match the entire sentence via re.match and use re.group(0) to get the whole match: 或者，您可以通过re.match匹配整个句子，然后使用re.group(0)来获得整个匹配：

>>> r = r"((\w+)(\s?))*"
>>> s = "This is a regular sentence."
>>> import re
>>> m = re.match(r, s)
>>> m.group(0)
'This is a regular sentence'

Answer 2

Here's an awesome Regular Expression tutorial website: 这是一个很棒的正则表达式教程网站：

http://regexone.com/ http://regexone.com/

Here's a Regular Expression that will match the examples given: 这是一个与给定示例匹配的正则表达式：

([a-zA-Z0-9,\. ]+)

Answer 3

Why do you want to limit the number of white space character in continuation? 为什么要连续限制空白字符的数量？ Because a sentence can have any number of words (sequences of alphanumeric characters) and spaces in a row, but rather a sentence is the area of text that ends with a punctuation mark or rather something that is not in the above sequence including white space. 因为一个句子可以连续包含任意数量的单词（字母数字字符的序列）和空格，所以句子是指以标点符号结尾的文本区域，或者不是以上序列中包含空格的东西。

([a-zA-Z0-9\s])*

The above regex will match a sentence wherein it is a series or spaces in series zero or more times. 上面的正则表达式将匹配一个句子，该句子是一个零个或更多次的系列或空格。 You can refine it to be the following though: 您可以将其精简为以下内容：

([a-zA-Z0-9])([a-zA-Z0-9\s])*

Which simply states that the above sequence must be prefaced with a alphanumeric character. 它只是说明上述序列必须以字母数字字符开头。

Hope this is what you were looking for. 希望这是你想要的。

Answer 4

Maybe this will help: 也许这会有所帮助：

import re

source = """
This is a regular sentence.
this is also valid
so is This ONE
how about this one  followed by this one
"""

re_sentence = re.compile(r'[^ \n.].*?(\.|\n|  +)')

def main():
    i = 0
    for s in re_sentence.finditer(source):
        print "%d:%s" % (i, s.group(0))
        i += 1

if __name__ == '__main__':
    main()

I am using alternation in the expression (\\.|\\n| +) to describe the end-of-sentence condition. 我在表达式(\\.|\\n| +)使用交替来描述句子结束条件。 Note the use of two spaces in the third alternation. 注意在第三个交替中使用两个空格。 The second space has the '+' meta-character so that two or more spaces in a row will be an end-of-sentence. 第二个空格具有“ +”元字符，因此连续两个或多个空格将成为句子的结尾。

正则表达式在Python中的混乱行为

问题描述

4 个解决方案

解决方案1
4 已采纳 2012-07-06 23:35:01

解决方案2
3 2012-07-06 23:34:46

解决方案3
0 2012-07-06 23:39:35

解决方案4
0 2012-07-07 15:42:23

正则表达式在Python中的混乱行为

问题描述

4 个解决方案

解决方案1 4 已采纳 2012-07-06 23:35:01

解决方案2 3 2012-07-06 23:34:46

解决方案3 0 2012-07-06 23:39:35

解决方案4 0 2012-07-07 15:42:23

解决方案1
4 已采纳 2012-07-06 23:35:01

解决方案2
3 2012-07-06 23:34:46

解决方案3
0 2012-07-06 23:39:35

解决方案4
0 2012-07-07 15:42:23