Pyparsing - Finding Nested Polynomials

Question

I'm searching through some algebra and trying to match all expressions of the form:

(subexp)^C

Where C is an integer and subexp can be one of two things:

a) it can be another expression of the form (subexp)^C b) It can be an expression of the form var1 op var2 op var3 ... op varn

where a var is of the form letters numbers, such as l2, cd3, hello53, etc. and an op is either -, *, or +. This second option can also have terms grouped by parens.

(There is no whitespace, I've just added whitespace above some places for clarity)

So, as an example:

(a12 + c33 + d34)^2
(a12 * c33 +-d34)^2
((a12 * c33)^5 + c3)^2

etc.

Expressions of this form will be embedded in a line of text. I need to find all instances of (subexp)^C, and replace them with pow(subexp, C). In short, I'm trying to convert some computer algebra to functioning C code.

I was originally doing this with regexp, but realized it wasn't a regular expression. For non-nested cases, the regexp is:

line = re.sub(r'\(([a-zA-Z_]+[0-9]+\+?-?\*?)+\)\^[0-9]+', replace_with_pow, line)

Here line is the line with the embedded polynomials, and replace_with_pow is the function that does replacement.

Unfortunately, when expressions can become nested, it's a CFG.

I have looked into pyparsing but have found the examples to be, well, difficult to parse, and the documentation lacking. That seems to be the recommended library though.

Can anyone provide an example script of how to find nested expressions and replace them? (It can be for a simplified problem if you'd like I can build off it)

EDIT: Update: with pyparsing, I can now parse all nested expressions that have the form { stuff ( ...)^C stuff }, using the following code:

closing = pyparsing.Word( ")^" + pyparsing.nums )
thecontent = pyparsing.Word(pyparsing.alphanums) | '+' | '-' | '*' | ',' 
parens     = pyparsing.nestedExpr( '(', closing, content=thecontent)

thecontent2 = thecontent | parens
parens2 = pyparsing.nestedExpr('{', '}', content=thecontent2)

res = parens2.parseString("{sdf(a + (a + b)^5 + c)asdf}")

This brings me to two questions:

a) when I have matched my ^5, the parser consumes it. How can I extract the ^5? b) Is there a quick / easy way to do replacement with pyparsing?

Answer 1

The first step in solving almost any parsing problem is coming up with a precise definition of the syntax to be parsed. If that syntax is context-free, then a context-free grammar is an excellent way of describing it, usually much better than informal descriptions or catalogs of examples.

In this question, your example

a12 * c33 +-d34

does not fit the description

var op var op var ... op var

because it has two op s side-by-side. And in the example

((a12 * c33)^5 + c3)^2

the subexp (a12 * c33)^5 + c3 is not var op var either; rather it is ( subexp ) op var . (I know, your text mentions that "terms can be grouped by parentheses", but fails to mention that the "term" can actually be a subexp since it might be an exponentiation, as in your example.)

A more precise grammar might be (if I'm guessing correctly), written in "yacc" syntax:

val : IDENTIFIER
    | '(' expr ')'
term: val
    | val '^' INTEGER
    | '-' val
prod: term
    | term '*' prod
expr: prod
    | expr '+' prod
    | expr '-' prod

The above does not allow --var , var^I1^I2 nor even -var^I . I have no idea from your description whether you would like those to work, but it would be easy to modify. Also, I would have thought that numeric literals would be acceptable, and not just variables, but again that's not mentioned in your problem description (and it would just need to be added to val ).

You might not actually need to parse so precisely, since you seem to be planning to just generate C code and thus dealing with operator precedence is unnecessary. On the other hand, you may someday wish to do algebraic transforms, in which case the more precise parse is necessary, and in any case it doesn't cost much.

Once you have the grammar, you could use ply (for example) to turn it into an executable parser.

(Here's a ply file I threw together. But it's not very good style :( )

from ply import lex, yacc
class Lexer(object):
  tokens = ('INTEGER', 'IDENTIFIER')
  literals = '+ - * ( ) ^'.split()
  t_ignore = ' \t\n'
  t_INTEGER = r'[1-9][0-9]*'
  t_IDENTIFIER = r'[a-z]+[0-9]+'

  def t_error(self, t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

  def build(self, **kwargs):
    self.lexer = lex.lex(module=self)

# Parser starts here

tokens = Lexer.tokens
start = 'expr'

def p_unit(p):
  '''val  : IDENTIFIER 
     term : val      
     prod : term    
     expr : prod   
  '''
  p[0] = p[1]

def p_paren(p):
  '''val  : '(' expr ')' '''
  p[0] = p[2]

def p_binary(p):
  '''expr : expr '+' prod
     expr : expr '-' prod
     prod : term '*' prod
  '''
  p[0] = '(%s%s%s)' %(p[1], p[2], p[3])

def p_pow(p):
  '''term : val '^' INTEGER'''
  p[0] = 'pow(%s, %s)' % (p[1], p[3])

def p_unary(p):
  '''term : '-' val'''
  p[0] = '(-%s)' % p[2]

parser = yacc.yacc()
lexer = Lexer().build()
def parse(text):
  return parser.parse(text, lexer=lexer)

if __name__ == '__main__':
  from sys import argv
  for text in argv[1:]:
    print(text + ' => ' + parse(text))

Quick test:

$ python exp.py '(a12 + c33 + d34)^2' '(a12 * c33 +-d34)^2'
(a12 + c33 + d34)^2 => pow(((a12+c33)+d34), 2)
(a12 * c33 +-d34)^2 => pow(((a12*c33)+(-d34)), 2)
$ python exp.py '((a12 * c33)^5 + c3)^2'
((a12 * c33)^5 + c3)^2 => pow((pow((a12*c33), 5)+c3), 2)

Answer 2

nestedExpr is not really the way to go here, that method in pyparsing is there when the items inside the nested punctuation are not too well-defined. In your case, you are better of defining your own nesting using pyparsing Forward() s. See below:

from pyparsing import *

# define arithmetic items
identifier = Word(alphas, alphanums+'_')
integer = Word(nums)
real = Regex(r'\d+\.\d*')
oper = oneOf("* + - /")

# define arithOperand as a Forward, since it could contain nested power expression
arithOperand = Forward()
arithExpr = arithOperand + ZeroOrMore(oper + arithOperand)
groupedArithExpr = '(' + arithExpr + ')'

# define expression for x^y, where x could be any kind of arithmetic term
powerOp = Literal('^')
powerExpr = (groupedArithExpr|real|integer|identifier) + powerOp + integer
powerExpr.setParseAction(lambda tokens: 'pow(%s,%s)' % (tokens[0], tokens[2]))

# now define the possible expressions for arithOperand, including a powerExpr
arithOperand <<= powerExpr | real | integer | identifier | groupedArithExpr

# convert parsed list of strings to a single string
groupedArithExpr.setParseAction(''.join)

# show how transform string will apply parse actions as transforms
print arithOperand.transformString("x = (4*(1 + 3^2) * a)^10")
print

prints x = pow((4*(1+pow(3,2))*a),10)

arithExpr.runTests("""\
    (a12 + c33 + d34)^2
    (a12 * c33 +-d34)^2
    (a12 * (c33 + c3))^2
    (a12 * (c33 + c3)^4)^2
    ((a12 * c33 + 12)^5 + c3)^2""")

prints

(a12 + c33 + d34)^2
['pow((a12+c33+d34),2)']

(a12 * c33 +-d34)^2
       ^
Expected ")" (at char 11), (line:1, col:12)

(a12 * (c33 + c3))^2
['pow((a12*(c33+c3)),2)']

(a12 * (c33 + c3)^4)^2
['pow((a12*pow((c33+c3),4)),2)']

((a12 * c33 + 12)^5 + c3)^2
['pow((pow((a12*c33+12),5)+c3),2)']

Note the use of transformString above - this will search your source code for matches and splice the modified code back in where the match was found.

Pyparsing - Finding Nested Polynomials

Question

2 answers

solution1
3 2015-11-01 00:05:45

solution2
2 ACCPTED 2015-11-01 20:56:07

Pyparsing - Finding Nested Polynomials

Question

2 answers

solution1 3 2015-11-01 00:05:45

solution2 2 ACCPTED 2015-11-01 20:56:07

solution1
3 2015-11-01 00:05:45

solution2
2 ACCPTED 2015-11-01 20:56:07