简体   繁体   中英

using non-greedy qualifiers in regular expression '*?'

I have a python 2.7 based program that takes input from a user (either as text file) or directly on the command prompt. I have to then verify the input to make sure that it is in the correct format (and reformat using code if needed).

Ideal inputs from user will be words separated by commas like "blade, coffeen, cardiac". But i wanted to put a check to allow inputs in the format "blade, coffeen, , cardiac" "blade,,cofeen,cardiac" "blade coffeen cardiac" etc.

This means I probably have to use a non-greedy qualifier like *? . Though I know basic regular expressions, this particular one *? is a bit unclear. Right now I am using input_string.split(',') followed by a input_string.split(' ') to account for the various scenarios, but somehow it feels too unpythonic and messy. I also wonder if there are scenarios that I haven't thought of and will break the code in production.

Looking on the internet, this link https://docs.python.org/2/howto/regex.html does a pretty good job of explaining how .*? works, and I think if I use [, ]*? , that can solve my problem.

My question is:
1. Can I use [, ]*? in my case, to account for the three possible scenarios for inputs that I described above?
As I mentioned, I am using string.split(',') after the validation anyways followed by a string.split(' ') , but a regular expression check will make it cleaner. In any case, I would love to understand how exactly the [, ]*? will behave if used.

Yes, you can use [, ]*? if it is part of a larger regular expression. And you may as well use re.split() in your case and avoid having to use str.split() separately.

However, you don't need to use the ? greediness modifier, as you are explicitly checking for words separated by spaces and commas . The word characters won't themselves ever match the [, ]* set. In fact, greedy matching will improve performance of the match here, as the regular expression will pick up all spaces and commas in one go, rather than just one at a time and then having to check if the rest of your pattern matches right after the first space or comma, then the next space or comma, etc.

So, the following works , but removing the ? makes the expression work better :

>>> import re
>>> test = "blade, coffeen,    , cardiac"
>>> re.search('blade[, ]*?coffeen[, ]*?cardiac', test)
<_sre.SRE_Match object at 0x100758c60>
>>> re.search('blade[, ]*coffeen[, ]*cardiac', test)
<_sre.SRE_Match object at 0x1026101d0>

You'll notice the problem when you try to use re.split() and only use [, ]*? as the pattern:

>>> import re
>>> test = "blade, coffeen,    , cardiac"
>>> re.split('[, ]*?', test)
['blade, coffeen,    , cardiac']

When splitting by [, ]*? , even a zero width string (an empty string) matches the expression, and re.split() won't split on empty strings alone. Being non-greedy, a zero-width string satisfied the test, and the regex engine won't go looking for more.

You could modify it to using +? (match at least one, or more):

>>> re.split('[, ]+?', test)
['blade', '', 'coffeen', '', '', '', '', '', '', 'cardiac']

Now you get a whole series of empty strings in-between, because those separate the various spaces and commas between the words.

Only when I remove the non-greedy modifier will it correctly split out your options into a list, because now all whitespace and commas between the words match and are used to split on:

>>> re.split('[, ]*', test)
['blade', 'coffeen', 'cardiac']

So don't fear greediness, not when matching a very specific subset of characters where the boundaries can't be confused or over-matched.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM