简体   繁体   中英

Python - parsing user input using a verbose regex

I am try to design a regex the will parse user input, in the form of full sentences. I am stuggling to get my expression to fully work. I know it is not well coded but I am trying hard to learn. I am currently trying to get it to parse precent as one string see under the code.

My test "sentence" = How I'm 15.5% wholesome-looking USA we RADAR () [] {} you -- are, ... you?

text = input("please type somewhat coherently: ")

pattern = r'''(?x)              # set flag to allow verbose regexps
    (?:[A-Z]\.)+                # abbreviations, e.g. U.S.A.
    |\w+(?:[-']\w+)*            # permit word-internal hyphens and apostrophes
    |[-.(]+                     # double hyphen, ellipsis, and open parenthesis
    |\S\w*                       # any sequence of word characters
    # |[\d+(\.\d+)?%]           # percentages, 82%
    |[][\{\}.,;"'?():-_`]       # these are separate tokens
    '''

parsed = re.findall(pattern, text)
print(parsed)

My output = ['How', "I'm", '15', '.', '5', '%', 'wholesome-looking', 'USA', 'we', 'RADAR', '(', ')', '[', ']', '{', '}', 'you', '--', 'are', ',', '...', 'you', '?']

I am looking to have the '15', '.', '5', '%' parsed as '15.5%' . The line that is currently commented out is what should do it, but when commented in does absolutly nothing. I searched for resources to help but they have not.

Thank you for you time.

If you just want to have the percentage match as a whole entity, you really should be aware that regex engine analyzes the input string and the pattern from left to right. If you have an alternation, the leftmost alternative that matches the input string will be chosen, the rest won't be even tested.

Thus, you need to pull the alternative \\d+(?:\\.\\d+)? up, and the capturing group should be turned into a non-capturing or findall will yield strange results:

(?x)              # set flag to allow verbose regexps
(?:[A-Z]\.)+                # abbreviations, e.g. U.S.A.
|\d+(?:\.\d+)?%           # percentages, 82%  <-- PULLED UP OVER HERE
|\w+(?:[-']\w+)*            # permit word-internal hyphens and apostrophes
|[-.(]+                     # double hyphen, ellipsis, and open parenthesis
|\S\w*                       # any sequence of word characters#
|[][{}.,;"'?():_`-]       # these are separate tokens

See regex demo .

Also, please note I replaced [][\\{\\}.,;"'?():-_`] with [][{}.,;"'?():_`-] : braces do not have to be escaped, and - was forming an unnecessary range from a colon (decimal code 58) and an underscore (decimal 95) matching ; , < , = , > , ? , @ , all the uppercase Latin letters, [ , \\ , ] and ^ .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM