简体   繁体   中英

parsing hierarchical data from XML to pandas?

I have inherited data in XML format that has multiple, nested 1:many relationships. The lowest level is coordinate data I wish to plot (as scatter and/or line). I wish to use the higher levels as grouping variables and for formatting. Recognizing the inefficiency, a 'flat' data organization is acceptable for my purposes.

A sample of the XML looks like:

<?xml version="1.0"?>
-<Hilltop>
-<Section SiteName="1">
-<DataSource NumItems="1" Name="RL Section">
<DataType>RLOffsetSection</DataType>
<ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>
-<Data>
-<E><O>-687.500</O><I1>95.837</I1></E>
-<E><O>-680.100</O><I1>95.773</I1></E>
</Data>
</Section>

-<Section SiteName="1">
-<DataSource NumItems="1" Name="RL Section">
<DataType>RLOffsetSection</DataType>
<ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime>10-Feb-1993 12:00:00</SurveyTime>
-<Data>
-<E><O>0.000</O><I1>91.138</I1></E>
-<E><O>5.600</O><I1>91.206</I1></E>
-<E><O>11.900</O><I1>91.012</I1></E>
</Data>
</Section>

<Section SiteName="2">
<DataSource Name="RL Section" NumItems="1">
<DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>
<Data>
<E><O>-99.100</O><I1>98.045</I1></E>
<E><O>-89.900</O><I1>98.334</I1></E>
<E><O>-85.900</O><I1>97.327</I1></E>
<E><O>-75.400</O><I1>94.177</I1></E>
<E><O>-53.100</O><I1>95.012</I1></E>
</Data>
</Section>

<Section SiteName="2">
<DataSource Name="RL Section" NumItems="1">
<DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime>10-Feb-1993 12:00:00</SurveyTime>
<Data>
<E><O>0.000</O><I1>95.001</I1></E>
<E><O>4.600</O><I1>94.936</I1></E>
<E><O>14.800</O><I1>94.069</I1></E>
</Data>
</Section>
</Hilltop>

I've adapted some code (borrowed from Gokhan Atil's blog) to parse into a pandas dataframe:

def getvalueofnode(node):
    """ return node text or None """
    return node.text if node is not None else None

def main():
    """ main """
    parsed_xml = et.parse(###my path and file###)
    dfcols = ['Survey', 'XSnum', 'STA', 'Elev', 'CODE']
    df_xml = pd.DataFrame(columns=dfcols)

    for node in parsed_xml.getroot():
        Survey = node.attrib.get('SurveyTime')
        XSnum = node.attrib.get('Section SiteName')
        STA = node.find('O')
        ELEV = node.find('I1')
        CODE = node.find('Code')

        df_xml = df_xml.append(
            pd.Series([getvalueofnode(Survey), getvalueofnode(XSnum),
                       getvalueofnode(STA), getvalueofnode(ELEV), 
                       getvalueofnode(CODE)], index=dfcols),
            ignore_index=True)

    print(df_xml)

main()

It runs, and I get the dataframe structure I'm after, but I only get cell values of 'None', snippet follows... screen capture of present output df

I suspect I'm calling the wrong method within the FOR loop, but have exhausted my (at this point) limited abilities troubleshooting code.

Using the above XML data as an example, I'm looking for the data like: screen capture of desired output

Any assistance getting this sorted is greatly appreciated. Cheers!


Well, I was able to put this off for a while with manually-intensive workaround, but it doesn't scale, so I'm back to working on this.

While I'd still like to end up at a df eventually, I'm now approaching more incrementally working toward a 3D array (surveys, station, elevation).

I've re-grouped and am at least able to extract data with:

import xml.etree.ElementTree as ET
import numpy as np

path = "somepath"
file = "somefile.xml"

tree = ET.parse(path + file)
root = tree.getroot()

So far, so good. I can get a list of each discrete survey and convert to an array:

surveys =[]
for d in root.findall("./Section/SurveyTime"):
    date = d.text[:-9]
    surveys.append(date)
surv_arr = np.array(surveys)

The next bit extracts the numeric values I want to eventually plot on my x-axis, but it is a running list (then array) of all the values tagged with in the xml source.

station = []
for s in root.findall("./Section/Data/E/O"):
    sta = float(s.text)
    station.append(sta)
sta_arr = np.array(station)

But I need the "sta" values for each survey grouped in their own array (eg so "sta_arr" is a 1 xn array where n is the number of surveys). So I try a nested for loop:

station = []

for section in root.findall('./Section'):
    source_items = section.find('DataSource').attrib
    date = section.find('SurveyTime').text[:-9]
    for s in section.findall("./Section/Data/E/O"):
        sta = float(s.text)
        station.append(sta)

It produces n arrays, but they're empty. So tweak the code a bit to iterate over the date variable (surveys):

station = []

for section in root.findall('./Section'):
    source_items = section.find('DataSource').attrib
    date = section.find('SurveyTime').text[:-9]
    for d in date:
        sta = d.find('.Section/Data/E/O')
        station.append(sta)

The good news is this populates arrays arrays of varying lengths for 'station'. Though the length seems to be incrementing by 1. The bad news is all the values are "-1" and the array length for each survey does not equal the number of observations (tagged '') in the xml file.

Any thoughts?

xmltodict was what I needed. I was able to sort through indexing in a ordered dictionary approach far faster than whatever hang-ups I was having with etree and minidom.

import xmltodict
import pandas as pd
xs, survey, station, elevation  = [], [], [], []
with open(path + file) as extract:
    orddict = xmltodict.parse(extract.read())
num_surveys = len(orddict['Hilltop']['Section'])
for i in range(0,num_surveys):
    XS_id = orddict['Hilltop']['Section'][i]['@SiteName']
    surv = orddict['Hilltop']['Section'][i]['SurveyTime'][:-9]
    sta_temp = []
    elev_temp = []
    data = orddict['Hilltop']['Section'][i]['Data']['E']
    for d in data:
        sta = float(d['O'])
        elev = float(d['I1'])
        sta_temp.append(sta)
        elev_temp.append(elev)
    xs.append(XS_id)
    survey.append(surv)
    station.append(sta_temp)
    elevation.append(elev_temp)
df = pd.DataFrame.from_dict({'xs':xs, 'survey':survey, 'station':station, 'elevation':elevation})

Note: The DataFrame as structured stores the coordinates as lists and not in a true 'flat' structure as in my original question/request. My solution here delivers the plotting and analytical performance I need while saving one grouping level, so is actually preferable (to me).

Enjoy!!!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM