Python - parsing user input using a verbose regex

Question

I am try to design a regex the will parse user input, in the form of full sentences. I am stuggling to get my expression to fully work. I know it is not well coded but I am trying hard to learn. I am currently trying to get it to parse precent as one string see under the code.

My test "sentence" = How I'm 15.5% wholesome-looking USA we RADAR () [] {} you -- are, ... you?

text = input("please type somewhat coherently: ")

pattern = r'''(?x)              # set flag to allow verbose regexps
    (?:[A-Z]\.)+                # abbreviations, e.g. U.S.A.
    |\w+(?:[-']\w+)*            # permit word-internal hyphens and apostrophes
    |[-.(]+                     # double hyphen, ellipsis, and open parenthesis
    |\S\w*                       # any sequence of word characters
    # |[\d+(\.\d+)?%]           # percentages, 82%
    |[][\{\}.,;"'?():-_`]       # these are separate tokens
    '''

parsed = re.findall(pattern, text)
print(parsed)

My output = ['How', "I'm", '15', '.', '5', '%', 'wholesome-looking', 'USA', 'we', 'RADAR', '(', ')', '[', ']', '{', '}', 'you', '--', 'are', ',', '...', 'you', '?']

I am looking to have the '15', '.', '5', '%' parsed as '15.5%' . The line that is currently commented out is what should do it, but when commented in does absolutly nothing. I searched for resources to help but they have not.

Thank you for you time.

Answer 1

If you just want to have the percentage match as a whole entity, you really should be aware that regex engine analyzes the input string and the pattern from left to right. If you have an alternation, the leftmost alternative that matches the input string will be chosen, the rest won't be even tested.

Thus, you need to pull the alternative \\d+(?:\\.\\d+)? up, and the capturing group should be turned into a non-capturing or findall will yield strange results:

(?x)              # set flag to allow verbose regexps
(?:[A-Z]\.)+                # abbreviations, e.g. U.S.A.
|\d+(?:\.\d+)?%           # percentages, 82%  <-- PULLED UP OVER HERE
|\w+(?:[-']\w+)*            # permit word-internal hyphens and apostrophes
|[-.(]+                     # double hyphen, ellipsis, and open parenthesis
|\S\w*                       # any sequence of word characters#
|[][{}.,;"'?():_`-]       # these are separate tokens

See regex demo .

Also, please note I replaced [][\\{\\}.,;"'?():-_`] with [][{}.,;"'?():_`-] : braces do not have to be escaped, and - was forming an unnecessary range from a colon (decimal code 58) and an underscore (decimal 95) matching ; , < , = , > , ? , @ , all the uppercase Latin letters, [ , \\ , ] and ^ .

Python - parsing user input using a verbose regex

Question

1 answers

solution1
1 ACCPTED 2015-08-30 21:40:04

Python - parsing user input using a verbose regex

Question

1 answers

solution1 1 ACCPTED 2015-08-30 21:40:04

solution1
1 ACCPTED 2015-08-30 21:40:04