简体   繁体   English

正则表达式在Python中的混乱行为

[英]Confusing Behaviour of regex in Python

I'm trying to match a specific pattern using the re module in python. 我正在尝试使用python中的re模块匹配特定模式。 I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation) 我希望匹配一个完整的句子(更正确地说,我说的是它们是字母数字字符串序列,由空格和/或标点符号分隔)

Eg. 例如。

  • "This is a regular sentence." “这是一个普通的句子。”
  • "this is also valid" “这也是有效的”
  • "so is This ONE" “这也是”

I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still). 我尝试了各种正则表达式组合,但是我无法正确掌握模式的工作方式,每个表达式给我带来了不同却莫名其妙的结果(我承认我是个初学者,但仍然可以)。


I'm tried: 我试过了:

  • "((\\w+)(\\s?))*" “((\\ w +)(\\ S'))*”

    To the best of my knowledge this should match one or more alpha alphanumerics greedily followed by either one or no white-space character and then it should match this entire pattern greedily. 据我所知,它应该贪婪地匹配一个或多个字母数字,然后是一个或没有空格字符,然后它应该贪婪地匹配整个模式。 This is not what it seems to do, so clearly I am wrong but I would like to know why. 这似乎不是这样做的,所以很明显我错了,但我想知道为什么。 (I expected this to return the entire sentence as the result) The result I get for the first sample string mentioned above is [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')]. (我希望这会返回整个句子作为结果)对于上述第一个示例字符串,我得到的结果是[('sentence','sentence',''),('',``,)), (“,”,“),(”,“,”))]。

  • "(\\w+ ?)*" “(\\ w +?)*”

    I'm not even sure how this one should work. 我什至不知道这应该如何工作。 The official documentation(python help('re')) says that the ,+,? 官方文档(python help('re'))表示,+ ,? Match x or x (greedy) repetitions of the preceding RE. 匹配先前RE的x或x(贪婪)重复。 In such a case is simply space the preceding RE for '?' 在这种情况下,只需在'RE'的前面加上RE or is '\\w+ ' the preceding RE? 还是'\\ w +'是前面的RE? And what will be the RE for the ' ' operator? 而' '运算符的RE将是什么 The output I get with this is ['sentence']. 我得到的输出是['entent']。

  • Others such as "(\\w+\\s?)+)" ; 其他如“(\\ w + \\ s?)+)”; "((\\w*)(\\s??)) etc. which are basically variation of the same idea that the sentence is a set of alpha numerics followed by a single/finite number of white spaces and this pattern is repeated over and over. “(((\\ w *)(\\ s ??)))等,这基本上是同一观点的变体,即句子是一组字母数字,后跟一个/有限数量的空格,并且该模式在和过度。

Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to? 有人可以告诉我我哪里出了问题以及为什么,以及上面的表达式为什么不能按我期望的那样工作?


PS I eventually got "[ \\w]+" to work for me but With this I cannot limit the number of white-space characters in continuation. PS我最终得到了[[\\ w] +“为我工作,但是有了这个我不能限制连续的空白字符的数量。

Your reasoning about the regex is correct, your problem is coming from using capturing groups with * . 您对正则表达式的推论是正确的,您的问题来自使用带有*捕获组。 Here's an alternative: 这是一个替代方案:

>>> s="This is a regular sentence."
>>> import re
>>> re.findall(r'\w+\s?', s)
['This ', 'is ', 'a ', 'regular ', 'sentence']

In this case it might make more sense for you to use \\b in order to match word boundries. 在这种情况下,使用\\b匹配单词边界可能更有意义。

>>> re.findall(r'\w+\b', s)
['This', 'is', 'a', 'regular', 'sentence']

Alternatively you can match the entire sentence via re.match and use re.group(0) to get the whole match: 或者,您可以通过re.match匹配整个句子,然后使用re.group(0)来获得整个匹配:

>>> r = r"((\w+)(\s?))*"
>>> s = "This is a regular sentence."
>>> import re
>>> m = re.match(r, s)
>>> m.group(0)
'This is a regular sentence'

Here's an awesome Regular Expression tutorial website: 这是一个很棒的正则表达式教程网站:

http://regexone.com/ http://regexone.com/

Here's a Regular Expression that will match the examples given: 这是一个与给定示例匹配的正则表达式:

([a-zA-Z0-9,\. ]+)

Why do you want to limit the number of white space character in continuation? 为什么要连续限制空白字符的数量? Because a sentence can have any number of words (sequences of alphanumeric characters) and spaces in a row, but rather a sentence is the area of text that ends with a punctuation mark or rather something that is not in the above sequence including white space. 因为一个句子可以连​​续包含任意数量的单词(字母数字字符的序列)和空格,所以句子是指以标点符号结尾的文本区域,或者不是以上序列中包含空格的东西。

([a-zA-Z0-9\s])*

The above regex will match a sentence wherein it is a series or spaces in series zero or more times. 上面的正则表达式将匹配一个句子,该句子是一个零个或更多次的系列或空格。 You can refine it to be the following though: 您可以将其精简为以下内容:

([a-zA-Z0-9])([a-zA-Z0-9\s])*

Which simply states that the above sequence must be prefaced with a alphanumeric character. 它只是说明上述序列必须以字母数字字符开头。

Hope this is what you were looking for. 希望这是你想要的。

Maybe this will help: 也许这会有所帮助:

import re

source = """
This is a regular sentence.
this is also valid
so is This ONE
how about this one  followed by this one
"""

re_sentence = re.compile(r'[^ \n.].*?(\.|\n|  +)')

def main():
    i = 0
    for s in re_sentence.finditer(source):
        print "%d:%s" % (i, s.group(0))
        i += 1

if __name__ == '__main__':
    main()

I am using alternation in the expression (\\.|\\n| +) to describe the end-of-sentence condition. 我在表达式(\\.|\\n| +)使用交替来描述句子结束条件。 Note the use of two spaces in the third alternation. 注意在第三个交替中使用两个空格。 The second space has the '+' meta-character so that two or more spaces in a row will be an end-of-sentence. 第二个空格具有“ +”元字符,因此连续两个或多个空格将成为句子的结尾。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM