简体   繁体   中英

Parsing XML file to Python list of dictionaries

This is an example XML file content, which I have to work with:

<states>
    <state name="foo">
        <and>
            <eq><text value="bar" /></eq>
            <or>
                <eqnull><text value="bar2" /></eqnull>
                <eqnull><text value="bar3" /></eqnull>
            </or>
        </and>
    </state>
</states>

This structure is unpredictable, it can change diametrically in each state. It can, in example, look like this:

<states>
    <state name="foo">
        <and>
            <or>
                <eq><text value="bar" /></eq>
                <eq><text value="bar2" /></eq>
            </or>
            <eqnull><selectedText value="bar3" number="1"></eqnull>
        </and>
    </state>
</states>

Independently from unpredictability of this structure, I want to parse it to a Python list of dictionaries, which will look like this (accordingly to first XML example):

[{'and': {'eq': {'text': {'value': 'bar'}}}},
{'and': {'or': [{'eqnull': {'text': {'value': 'bar2'}}}, 
                {'eqnull': {'text': {'value': 'bar3'}}},]}}]

I was trying to use ElementTree and get content of state structure as a dictionary using:

xmltodict.parse

and then recursively strip this dictionary (key by key) to my list of dictionaries. This solution is very hard for me to implement (unfortunately I'm not a Python developer...) and I am wandering, if there is some easier way to do this.

I have another solution in mind: iterate through each node in XML structure, dinamically build dictionaries and, finally, list of dictionaries. But there is one problem: I do not know, when ie eq node ends. If there were some way to recognize ending node /eq I think it will be manageable.

Or maybe there is some another way in Python which I do not know...

Here is an example of how you could do by recursively adding the content of each node:

def findMarkup(str, mainlist):
    markup = re.search('<([^>]*)>', str)
    if markup:
        markup_content = markup.group(1)
        begin = markup.end()
        name = markup_content.split(' ')[0]
        #we check if the markup ends itself
        if markup_content.find('/')!=-1:
            end = begin+1
        else:
            end = str.find('</{0}>'.format(name))
        if begin+1<end:
            #the node has child, its content is theirs
            inner_value = []
            findMarkup(str[begin:end], inner_value)
        else:
            #the content of the current node is its attributes
            inner_value = getAttr(markup_content)
        #we add the content of the current node
        mainlist.append({name:inner_value})

    #we iterate on the rest of the string for same level markups
    findMarkup(str[end+2:],mainlist) 


def getAttr(markup_content):
    attr_list = re.finditer('(\w*)="(\w*)"', markup_content) 
    attr_dict = dict()
    for attr in attr_list:
        attr_dict[attr.group(1)] = attr.group(2)
    return attr_dict

It gave me something like (if I look inside the state content, cause state will be also counted as node)

[{'and': [{'eq': [{'text': {'value': 'bar'}}]}, {'or': [{'eqnull': [{'text': {'value': 'bar2'}}]}, {'eqnull': [{'text': {'value': 'bar3'}}]}]}]}]

It's not exactly how you wanted it but you can still manage to get the info I guess. You just instantiate an empty list and put the xml content in a string and then call once findMarkup(xml_in_string, empty_list), the list will be filled.

Note that I don't really know your end purpose so a simple copy-paste may not be enough, maybe you should refine the part where I create inner_value... Also, this code assumes that the file is perfectly written, you should add exception handling if required.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM