[英]Confusing Behaviour of regex in Python
I'm trying to match a specific pattern using the re module in python. 我正在尝试使用python中的re模块匹配特定模式。 I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation)
我希望匹配一个完整的句子(更正确地说,我说的是它们是字母数字字符串序列,由空格和/或标点符号分隔)
Eg. 例如。
I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still). 我尝试了各种正则表达式组合,但是我无法正确掌握模式的工作方式,每个表达式给我带来了不同却莫名其妙的结果(我承认我是个初学者,但仍然可以)。
I'm tried: 我试过了:
"((\\w+)(\\s?))*" “((\\ w +)(\\ S'))*”
To the best of my knowledge this should match one or more alpha alphanumerics greedily followed by either one or no white-space character and then it should match this entire pattern greedily. 据我所知,它应该贪婪地匹配一个或多个字母数字,然后是一个或没有空格字符,然后它应该贪婪地匹配整个模式。 This is not what it seems to do, so clearly I am wrong but I would like to know why.
这似乎不是这样做的,所以很明显我错了,但我想知道为什么。 (I expected this to return the entire sentence as the result) The result I get for the first sample string mentioned above is [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')].
(我希望这会返回整个句子作为结果)对于上述第一个示例字符串,我得到的结果是[('sentence','sentence',''),('',``,)), (“,”,“),(”,“,”))]。
"(\\w+ ?)*" “(\\ w +?)*”
I'm not even sure how this one should work. 我什至不知道这应该如何工作。 The official documentation(python help('re')) says that the ,+,?
官方文档(python help('re'))表示,+ ,? Match x or x (greedy) repetitions of the preceding RE.
匹配先前RE的x或x(贪婪)重复。 In such a case is simply space the preceding RE for '?'
在这种情况下,只需在'RE'的前面加上RE or is '\\w+ ' the preceding RE?
还是'\\ w +'是前面的RE? And what will be the RE for the ' ' operator?
而' '运算符的RE将是什么 ? The output I get with this is ['sentence'].
我得到的输出是['entent']。
Others such as "(\\w+\\s?)+)" ; 其他如“(\\ w + \\ s?)+)”; "((\\w*)(\\s??)) etc. which are basically variation of the same idea that the sentence is a set of alpha numerics followed by a single/finite number of white spaces and this pattern is repeated over and over.
“(((\\ w *)(\\ s ??)))等,这基本上是同一观点的变体,即句子是一组字母数字,后跟一个/有限数量的空格,并且该模式在和过度。
Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to? 有人可以告诉我我哪里出了问题以及为什么,以及上面的表达式为什么不能按我期望的那样工作?
PS I eventually got "[ \\w]+" to work for me but With this I cannot limit the number of white-space characters in continuation. PS我最终得到了[[\\ w] +“为我工作,但是有了这个我不能限制连续的空白字符的数量。
Your reasoning about the regex is correct, your problem is coming from using capturing groups with *
. 您对正则表达式的推论是正确的,您的问题来自使用带有
*
捕获组。 Here's an alternative: 这是一个替代方案:
>>> s="This is a regular sentence."
>>> import re
>>> re.findall(r'\w+\s?', s)
['This ', 'is ', 'a ', 'regular ', 'sentence']
In this case it might make more sense for you to use \\b
in order to match word boundries. 在这种情况下,使用
\\b
匹配单词边界可能更有意义。
>>> re.findall(r'\w+\b', s)
['This', 'is', 'a', 'regular', 'sentence']
Alternatively you can match the entire sentence via re.match
and use re.group(0)
to get the whole match: 或者,您可以通过
re.match
匹配整个句子,然后使用re.group(0)
来获得整个匹配:
>>> r = r"((\w+)(\s?))*"
>>> s = "This is a regular sentence."
>>> import re
>>> m = re.match(r, s)
>>> m.group(0)
'This is a regular sentence'
Here's an awesome Regular Expression tutorial website: 这是一个很棒的正则表达式教程网站:
http://regexone.com/ http://regexone.com/
Here's a Regular Expression that will match the examples given: 这是一个与给定示例匹配的正则表达式:
([a-zA-Z0-9,\. ]+)
Why do you want to limit the number of white space character in continuation? 为什么要连续限制空白字符的数量? Because a sentence can have any number of words (sequences of alphanumeric characters) and spaces in a row, but rather a sentence is the area of text that ends with a punctuation mark or rather something that is not in the above sequence including white space.
因为一个句子可以连续包含任意数量的单词(字母数字字符的序列)和空格,所以句子是指以标点符号结尾的文本区域,或者不是以上序列中包含空格的东西。
([a-zA-Z0-9\s])*
The above regex will match a sentence wherein it is a series or spaces in series zero or more times. 上面的正则表达式将匹配一个句子,该句子是一个零个或更多次的系列或空格。 You can refine it to be the following though:
您可以将其精简为以下内容:
([a-zA-Z0-9])([a-zA-Z0-9\s])*
Which simply states that the above sequence must be prefaced with a alphanumeric character. 它只是说明上述序列必须以字母数字字符开头。
Hope this is what you were looking for. 希望这是你想要的。
Maybe this will help: 也许这会有所帮助:
import re
source = """
This is a regular sentence.
this is also valid
so is This ONE
how about this one followed by this one
"""
re_sentence = re.compile(r'[^ \n.].*?(\.|\n| +)')
def main():
i = 0
for s in re_sentence.finditer(source):
print "%d:%s" % (i, s.group(0))
i += 1
if __name__ == '__main__':
main()
I am using alternation in the expression (\\.|\\n| +)
to describe the end-of-sentence condition. 我在表达式
(\\.|\\n| +)
使用交替来描述句子结束条件。 Note the use of two spaces in the third alternation. 注意在第三个交替中使用两个空格。 The second space has the '+' meta-character so that two or more spaces in a row will be an end-of-sentence.
第二个空格具有“ +”元字符,因此连续两个或多个空格将成为句子的结尾。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.