As an exercise I was trying to come up with a regex to evaluate simple algebra like
q = '23 * 345 - 123+65'
From here I want to get '23', '*', '345', '-', '123', '+', '65'.
Now, I've searched similar questions, and other people have solved this. But what I really want to know is why my solution doesn't work.
Here's the best I got:
regexparse = '(\d+\s*(\*|\/|\+|\-)\s*)+(\d+\s*)'
Explanation
However, when I run the code
m = re.match(regexparse, q)
print m.group(0)
print m.group(1)
print m.group(2)
print m.group(3)
I get
23 * 345 - 123+65
123+
+
65
So it's like the first block is matching the least amount possible of chars. Why?
This is your regex:
(\d+\s*(\*|\/|\+|\-)\s*)+(\d+\s*)
(\\d+\\s*(\\*|\\/|\\+|\\-)\\s*)
will match the first part of your expression: 23 *
and store *
in the second group.
Then the +
makes it repeat, but because repeating capture groups retain only their last match, it will discard 23 *
and *
and instead match 345 -
and -
in the second group.
The +
works again on the next repeat to discard the last capture and instead capture 123+
in the first group and +
in the second.
Next, +
cannot repeat any more, so it stops, and (\\d+\\s*)
starts matching to get 65
.
The fact that repeating capture groups store only the last capture is how regex works by design and is like this in all regex engines AFAIK.
Further elaboration:
There's a difference between matching repeatedly and capturing repeatedly. Try: (\\d)+
on 12345
and you will see that only 5
will be captured. It's like that because you the paren is assigned a particular group capture. The first group is assigned group 1 and if you have many captures for group 1, you can only keep 1 and that's the last. This is how regex works, unfortunately, as per the docs :
If a group matches multiple times, only the last match is accessible
If you want to get your desired output, you might use re.findall
and match with \\d+|[+/*-]
:
import re
q = '23 * 345 - 123+65'
regexparse = r'\d+|[+/*-]'
elem = re.findall(regexparse, q)
print(elem)
#=> ['23', '*', '345', '-', '123', '+', '65']
I can only speak of regex in general, as I don't know python, but your problem is that in
(\d+\s*[\*/+-]\s*)+(\d+\s*)
This portion
(\d+\s*[\*/+-]\s*)+
Is being repeated and when it's completely done evaluating, you only see the final one.
Simply try this.
import re
q = '23 * 345 - 123+65'
regexparse = r'(\d+)|[-+*/]'
for i in re.finditer(regexparse, q):
print i.group(0)
output:
23
*
345
-
123
+
65
Your regex is confusing. Better to use re.split()
for this purpose:
q = '23 * 345 - 123+65'
print re.split('\s*([-+/*])\s*', q)
Outputs:
['23', '*', '345', '-', '123', '+', '65']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.