简体   繁体   English

用正则表达式解析算术字符串

[英]Parse arithmetic string with regular expression

I need to parse an arithmetic string with only times ( * ) and add ( + ), eg, 300+10*51+20+2*21 , use regular expressions. 我只需要用一次* )解析算术字符串并添加+ ),例如300+10*51+20+2*21 ,就可以使用正则表达式。

I have the working code below: 我有下面的工作代码:

import re


input_str = '300+10*51+20+2*21'

#input_str = '1*2+3*4'


prod_re = re.compile(r"(\d+)\*(\d+)")
sum_re = re.compile(r"(\d+)\+?")

result = 0
index = 0
while (index <= len(input_str)-1):
    #-----
    prod_match = prod_re.match(input_str, index)
    if prod_match:
        # print 'find prod', prod_match.groups()
        result += int(prod_match.group(1))*int(prod_match.group(2))
        index += len(prod_match.group(0))+1
        continue
    #-----
    sum_match = sum_re.match(input_str, index)
    if sum_match:
        # print 'find sum', sum_match.groups()
        result += int(sum_match.group(1))
        index += len(sum_match.group(0))
        continue
    #-----
    if (not prod_match) and (not sum_match):
        print 'None match, check input string'
        break


print result

I am wondering if there is a way to avoid creating the variable index above? 我想知道是否有办法避免在上面创建变量index

The algorithm seems not correct. 该算法似乎不正确。 An input of 1*2+3*4 does not yield a correct result. 输入1*2+3*4不会产生正确的结果。 It seems wrong that after resolving one multiplication you continue to resolve an addition, while in some cases you would have to first resolve more multiplications before doing any additions. 在解决一个乘法之后您继续解析一个加法似乎是错误的,而在某些情况下,您必须先解析更多的乘法再进行任何加法。

With some change in the regular expressions and loops, you can achieve what you want as follows: 通过对正则表达式和循环进行一些更改,您可以实现所需的目标,如下所示:

import re

input_str = '3+1*2+3*4'

# match terms, which may include multiplications
sum_re = re.compile(r"(\d+(?:\*\d+)*)(?:\+|$)")
# match factors, which can only be numbers 
prod_re = re.compile(r"\d+")

result = 0
# find terms
for sum_match in sum_re.findall(input_str):
    # for each term, determine its value by applying the multiplications
    product = 1
    for prod_match in prod_re.findall(sum_match):
        product *= int(prod_match)
    # add the term's value to the result
    result += product

print (result)

Explanation 说明

This regular expression: 这个正则表达式:

(\d+(?:\*\d+)*)(?:\+|$)

... matches an integer followed by zero or more multiplications: ...匹配一个整数,后跟零个或多个乘法:

(?:\*\d+)*

The (?: makes it a non-capture group. Without ?: the method findall would assign this part of the match to a separate list element, which we don't want. (?:使其成为非捕获组。没有?:方法, findall会将匹配的这一部分分配给一个单独的列表元素,我们不希望这样。

\\*\\d+ is: a literal asterisk followed by digits. \\*\\d+是:文字星号后跟数字。

The final (?:\\+|$) is again a non-capture group, that requires either a literal + to follow, or the end of the input ( $ ). 最后一个(?:\\+|$)还是一个非捕获组,它需要跟在后面的文字+或输入的末尾( $ )。

The solution to your problem should be a possible sign preceded term followed by a list of terms, separated by an adding operator like in 解决问题的方法应该是在term前加一个可能的符号,然后是term列表,并用加法运算符分隔,例如

[+-]?({term}([+-]{term})*)

in which each term is one factor, followed by a possible empty list of a multiplicative operator and another factor, like this: 其中每个项是一个因素,其后可能是一个乘法运算符的空列表,而另一个因素是这样的:

{factor}([*/]{factor})*

where factor is a sequence of digits [0-9]+ , so substituting, leads to: 其中factor是一个数字序列[0-9]+ ,因此替换为:

[+-]?([0-9]+([*/][0-9]+)*([+-][0-9]+([*/][0-9]+)*)*)

This will be a possible regexp that you can have, It assumes the structure of precedence between the operators that you can have. 这将是您可能拥有的正则表达式,它假定您可以拥有的运算符之间的优先级结构。 But it doesn't allow you to extract the different elements, as is demonstrated easily: the regexp has only 4 group elements inside (4 left parenthesis) so you can only match four of these (the first term, the last factor of the first term, the last term, and the last factor of the last term. If you begin to surround subexpressions with parenthesis, you can get more, but thing that the number of groups in a regexp is finite , and you can construct a possible infinitely long regular expression. 但是,它不允许您提取不同的元素,这很容易证明:正则表达式内部仅包含4个组元素(左括号为4个),因此您只能匹配其中四个(第一个项,第一个项的最后一个因子)项,最后一项以及最后一项的最后一个因子如果开始用括号将子表达式包围起来,则可以得到更多,但正则表达式中的组数是有限的 ,并且可以构造一个可能无限长的正则表达式。

Said this (that you will not be able to separate all groups of things from the regexp structure) a different approach is taken: first sign is optional, and can be followed by an undefined number of terms, separated by either multiplicative operators or by additive ones: 这样说(您将无法从正则表达式结构中分离出所有事物)采取了另一种方法:第一个符号是可选的,并且可以跟在后面的未定义数量的术语,由乘法运算符或加法符分隔那些:

[+-]?([0-9]+([*/+-][0-9]+)*

will do the work also (it matches the same set of expressions. Even if you restrict to the fact that only one operator can be interspesed in any secuence of 1 or more digits, the resulting regexp could be simplified to: 也将起作用(它匹配相同的表达式集。即使您限制只有一个运算符可以插入1个或多个数字的安全性这一事实,结果正则表达式也可以简化为:

[-+]?[0-9]([*/+-]?[0-9])*

or with the usual notations used nowadays, to: 或使用当今常用的符号来:

[-+]?\d([*/+-]?\d)* 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM