简体   繁体   English

在正则表达式“ *?”中使用非贪婪限定词

[英]using non-greedy qualifiers in regular expression '*?'

I have a python 2.7 based program that takes input from a user (either as text file) or directly on the command prompt. 我有一个基于python 2.7的程序,可以从用户(作为文本文件)或直接在命令提示符下输入。 I have to then verify the input to make sure that it is in the correct format (and reformat using code if needed). 然后,我必须验证输入以确保其格式正确(并在需要时使用代码重新格式化)。

Ideal inputs from user will be words separated by commas like "blade, coffeen, cardiac". 来自用户的理想输入将是用逗号分隔的单词,例如"blade, coffeen, cardiac". But i wanted to put a check to allow inputs in the format "blade, coffeen, , cardiac" "blade,,cofeen,cardiac" "blade coffeen cardiac" etc. 但是我想进行检查以允许输入格式为"blade, coffeen, , cardiac" "blade,,cofeen,cardiac" "blade coffeen cardiac"等。

This means I probably have to use a non-greedy qualifier like *? 这意味着我可能必须使用非贪婪的限定词,例如*? . Though I know basic regular expressions, this particular one *? 尽管我知道基本的正则表达式,但是这个特殊的*? is a bit unclear. 还不清楚。 Right now I am using input_string.split(',') followed by a input_string.split(' ') to account for the various scenarios, but somehow it feels too unpythonic and messy. 现在,我正在使用input_string.split(',')然后使用input_string.split(' ')来说明各种情况,但是从某种程度上来说,它感觉太过Python和混乱了。 I also wonder if there are scenarios that I haven't thought of and will break the code in production. 我还想知道是否存在我未曾想到的场景,这些场景会破坏生产中的代码。

Looking on the internet, this link 在互联网上查看此链接 https://docs.python.org/2/howto/regex.html does a pretty good job of explaining how .*? https://docs.python.org/2/howto/regex.html在解释.*?做得很好.*? works, and I think if I use [, ]*? 可行,我想我是否使用[, ]*? , that can solve my problem. ,可以解决我的问题。

My question is: 我的问题是:
1. Can I use [, ]*? 1.我可以使用[, ]*? in my case, to account for the three possible scenarios for inputs that I described above? 就我而言,考虑到我上面描述的三种可能的输入情景?
As I mentioned, I am using string.split(',') after the validation anyways followed by a string.split(' ') , but a regular expression check will make it cleaner. 正如我提到的,无论如何,在验证之后我都会使用string.split(',')之后再使用string.split(' ') ,但是使用正则表达式检查会使其更整洁。 In any case, I would love to understand how exactly the [, ]*? 无论如何,我想了解[, ]*?到底是[, ]*? will behave if used. 如果使用将表现。

Yes, you can use [, ]*? 是的,您可以使用[, ]*? if it is part of a larger regular expression. 如果它是较大的正则表达式的一部分。 And you may as well use re.split() in your case and avoid having to use str.split() separately. 您也可以在这种情况下使用re.split() ,而不必单独使用str.split()

However, you don't need to use the ? 但是,您不需要使用? greediness modifier, as you are explicitly checking for words separated by spaces and commas . 贪婪修饰符,因为您要显式检查用空格和逗号分隔的单词 The word characters won't themselves ever match the [, ]* set. 单词字符本身不会匹配[, ]*集。 In fact, greedy matching will improve performance of the match here, as the regular expression will pick up all spaces and commas in one go, rather than just one at a time and then having to check if the rest of your pattern matches right after the first space or comma, then the next space or comma, etc. 实际上,贪婪的匹配将提高匹配的性能,因为正则表达式将一次性获取所有空格和逗号,而不是一次仅获取一个,然后必须检查模式的其余部分是否匹配。第一个空格或逗号,然后是下一个空格或逗号,依此类推。

So, the following works , but removing the ? 因此,以下工作 ,但删除? makes the expression work better : 使表达式更好地工作:

>>> import re
>>> test = "blade, coffeen,    , cardiac"
>>> re.search('blade[, ]*?coffeen[, ]*?cardiac', test)
<_sre.SRE_Match object at 0x100758c60>
>>> re.search('blade[, ]*coffeen[, ]*cardiac', test)
<_sre.SRE_Match object at 0x1026101d0>

You'll notice the problem when you try to use re.split() and only use [, ]*? 当您尝试使用re.split()而仅使用[, ]*?时,您会注意到问题[, ]*? as the pattern: 作为模式:

>>> import re
>>> test = "blade, coffeen,    , cardiac"
>>> re.split('[, ]*?', test)
['blade, coffeen,    , cardiac']

When splitting by [, ]*? [, ]*?分割时[, ]*? , even a zero width string (an empty string) matches the expression, and re.split() won't split on empty strings alone. ,即使宽度为零的字符串 (空字符串re.split()也会与表达式匹配,并且re.split()不会仅在空字符串上分割。 Being non-greedy, a zero-width string satisfied the test, and the regex engine won't go looking for more. 不贪心,零宽度的字符串满足测试要求,并且正则表达式引擎不会再寻找更多内容。

You could modify it to using +? 您可以将其修改为使用+? (match at least one, or more): (至少匹配一个或多个):

>>> re.split('[, ]+?', test)
['blade', '', 'coffeen', '', '', '', '', '', '', 'cardiac']

Now you get a whole series of empty strings in-between, because those separate the various spaces and commas between the words. 现在,您将在中间得到一系列完整的空字符串,因为它们将单词之间的各种空格和逗号分隔开。

Only when I remove the non-greedy modifier will it correctly split out your options into a list, because now all whitespace and commas between the words match and are used to split on: 仅当我删除非贪婪修饰符时,它才能正确地将您的选项拆分为一个列表,因为现在单词之间的所有空格和逗号都匹配并且用于拆分:

>>> re.split('[, ]*', test)
['blade', 'coffeen', 'cardiac']

So don't fear greediness, not when matching a very specific subset of characters where the boundaries can't be confused or over-matched. 因此,不要担心贪婪,当匹配边界不会混淆或过度匹配的非常特定的字符子集时,不要担心。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM