简体   繁体   中英

Regular expression for simple math expressions

As an exercise I was trying to come up with a regex to evaluate simple algebra like

q = '23 * 345 - 123+65'

From here I want to get '23', '*', '345', '-', '123', '+', '65'.

Now, I've searched similar questions, and other people have solved this. But what I really want to know is why my solution doesn't work.

Here's the best I got:

regexparse = '(\d+\s*(\*|\/|\+|\-)\s*)+(\d+\s*)'

Explanation

  • (\\d+\\s*(*|/|+|-)\\s*)+
    • ( One or more digits \\d+ may be followed by whitespace \\s* then must be followed by one of the symbols (*|/|+|-) then may be followed by more whitespace \\s* and the whole thing must appear at least once )+
  • (\\d+\\s*)
    • One or more digits which may be followed by whitespaces

However, when I run the code

m = re.match(regexparse, q)
print m.group(0)
print m.group(1)
print m.group(2)
print m.group(3)

I get

23 * 345 - 123+65
123+
+
65

So it's like the first block is matching the least amount possible of chars. Why?

This is your regex:

(\d+\s*(\*|\/|\+|\-)\s*)+(\d+\s*)

(\\d+\\s*(\\*|\\/|\\+|\\-)\\s*) will match the first part of your expression: 23 * and store * in the second group.

Then the + makes it repeat, but because repeating capture groups retain only their last match, it will discard 23 * and * and instead match 345 - and - in the second group.

The + works again on the next repeat to discard the last capture and instead capture 123+ in the first group and + in the second.

Next, + cannot repeat any more, so it stops, and (\\d+\\s*) starts matching to get 65 .


The fact that repeating capture groups store only the last capture is how regex works by design and is like this in all regex engines AFAIK.


Further elaboration:

There's a difference between matching repeatedly and capturing repeatedly. Try: (\\d)+ on 12345 and you will see that only 5 will be captured. It's like that because you the paren is assigned a particular group capture. The first group is assigned group 1 and if you have many captures for group 1, you can only keep 1 and that's the last. This is how regex works, unfortunately, as per the docs :

If a group matches multiple times, only the last match is accessible


If you want to get your desired output, you might use re.findall and match with \\d+|[+/*-] :

import re
q = '23 * 345 - 123+65'
regexparse = r'\d+|[+/*-]'
elem = re.findall(regexparse, q)
print(elem)
#=> ['23', '*', '345', '-', '123', '+', '65']

I can only speak of regex in general, as I don't know python, but your problem is that in

(\d+\s*[\*/+-]\s*)+(\d+\s*)

This portion

(\d+\s*[\*/+-]\s*)+

Is being repeated and when it's completely done evaluating, you only see the final one.

Simply try this.

import re
q = '23 * 345 - 123+65'
regexparse = r'(\d+)|[-+*/]'
for i in re.finditer(regexparse, q):
    print i.group(0)

output:

23
*
345
-
123
+
65

Your regex is confusing. Better to use re.split() for this purpose:

q = '23 * 345 - 123+65'
print re.split('\s*([-+/*])\s*', q)

Outputs:

['23', '*', '345', '-', '123', '+', '65']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM