简体   繁体   中英

How to recursive iterparse an LXML tree avoiding to enter twice a node?

the recursive function is parseMML. I want it to parse a MathML expression into a Python one. The simple example mmlinput is por producing the fraction 3/5, but it produces:

['(', '(', '3', ')', '/', '(', '5', ')', '(', '3', ')', '(', '5', ')', ')']

Instead of:

['(', '(', '3', ')', '/', '(', '5', ')', ')']

Because I don't know how to get rid of the elements entered already recursively. Any ideas about how to skip them?

Thanks

mmlinput='''<?xml version="1.0"?> <math xmlns="http://www.w3.org/1998/Math/MathML" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1998/Math/MathML http://www.w3.org/Math/XMLSchema/mathml2/mathml2.xsd"> <mrow> <mfrac> <mrow> <mn>3</mn> </mrow> <mrow> <mn>5</mn> </mrow> </mfrac> </mrow> </math>'''


def parseMML(mmlinput):
    from lxml import etree
    from StringIO import *
    from lxml import objectify
    exppy=[]
    events = ("start", "end")
    context = etree.iterparse(StringIO(mmlinput),events=events)
    for action, elem in context:
        if (action=='start') and (elem.tag=='mrow'):
            exppy+='('
        if (action=='end') and (elem.tag=='mrow'):
            exppy+=')'
        if (action=='start') and (elem.tag=='mfrac'):
            mmlaux=etree.tostring(elem[0])
            exppy+=parseMML(mmlaux)
            exppy+='/'
            mmlaux=etree.tostring(elem[1])
            exppy+=parseMML(mmlaux)
        if action=='start' and elem.tag=='mn': #this is a number
            exppy+=elem.text
    return (exppy)

The problem is that you're parsing the subtrees within the mfrac tag twice, because you're parsing it recursively. A quick fix would be to count the recursion level:

mmlinput = "<math> <mrow> <mfrac> <mrow> <mn>3</mn> </mrow> <mrow> <mn>5</mn> </mrow> </mfrac> </mrow> </math>"

def parseMML(mmlinput):
    from lxml import etree
    from StringIO import *
    from lxml import objectify
    exppy=[]
    events = ("start", "end")
    level = 0
    context = etree.iterparse(StringIO(mmlinput),events=events)
    for action, elem in context:
        if (action=='start') and (elem.tag=='mfrac'):
            level += 1
            mmlaux=etree.tostring(elem[0])
            exppy+=parseMML(mmlaux)
            exppy+='/'
            mmlaux=etree.tostring(elem[1])
            exppy+=parseMML(mmlaux)
        if (action=='end') and (elem.tag=='mfrac'):
            level -= 1
        if level:
            continue
        if (action=='start') and (elem.tag=='mrow'):
            exppy+='('
        if (action=='end') and (elem.tag=='mrow'):
            exppy+=')'
        if action=='start' and elem.tag=='mn': #this is a number
            exppy+=elem.text
    return (exppy)

Note: I had to remove the namespace to make this work, as elem.tag returns the fully qualified tag name for me. Also you're using += to add strings to a list. For single character strings that may work, but + on a list works like calling extend , so:

>>> lst = []
>>> lst += 'spam'
>>> lst
['s', 'p', 'a', 'm']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM