简体   繁体   中英

Dynamic search through xml attributes using lxml and xpath in python

I am working to move nexted xml data into a hierarchical data frame. I was able to get all of the data out of the xml thanks to help on SO. However, now, I am working to clean up the data that I extract and shape it before output because I will be doing this thousands of times.

UPDATED: THIS IS WHAT I EVENTUALLY WANT OUT. I cannot seem to fetch just the Time and value for channel dynamically. The channel names will change for each file.

When channel = txt1[0] (for this file, txt1[0]="blah" ) through when channel = txt1[8] (for this file, txt1[8]="lir" )

    channel      Time                    value
0     blah     2013-05-01 00:00:00    258
1     blah     2013-05-01 00:01:00    259
...
n-2   lir      2013-05-01 23:57:00    58
n-1   lir      2013-05-01 23:58:00    37
n     lir      2013-05-01 23:59:00    32

Here is how my xml file is fetched and structured:

import requests
from lxml import etree, objectify
r = requests.get('https://api.stuff.us/place/getData?   security_key=key&period=minutes&startTime=2013-05-01T00:00&endTime=2013-05-01T23:59&sort=channel') #edited for privacy
root = etree.fromstring(r.text)
xml_new = etree.tostring(root, pretty_print=True)
print xml_new[300:900] #gives xml output to show structure
<startTime>2013-05-01 00:00:00</startTime>
<endTime>2013-05-01 23:59:00</endTime>
<summaryPeriod>minutes</summaryPeriod>
<data>
  <channel channel="97925" name="blah"> 
    <Time Time="2013-05-01 00:00:00">
      <value>258</value>
    </Time>
    <Time Time="2013-05-01 00:01:00">
      <value>259</value>
    </Time>
    <Time Time="2013-05-01 00:02:00">
      <value>258</value>
    </Time>
    <Time Time="2013-05-01 00:03:00">
      <value>257</value>
    </Time>

Yesterday, I asked here on SO and solved the problem of getting the time and value values into a data frame: Parsing xml to pandas data frame throws memory error

dTime=[]
dvalue=[]
for df in root.xpath('//channel/Time'):
    ## Iterate over attributes of Time but Time only has one attrib [@Time]
    for attrib in df.attrib:
    dTime.append(df.attrib[attrib])
    ## value is a child of time, and iterate
    subfields = df.getchildren()
    for subfield in subfields:
    dvalue.append(subfield.text)
pef=DataFrame({'Time':dTime,'values':dvalue})

pef

&ltclass 'pandas.core.frame.DataFrame'&gt
Int64Index: 12960 entries, 0 to 12959
Data columns (total 2 columns):
Time     12960  non-null values
value    12960  non-null values
dtypes: object(2) 

pef[:5]

    Time                    value
0    2013-05-01 00:00:00    258
1    2013-05-01 00:01:00    259
2    2013-05-01 00:02:00    258
3    2013-05-01 00:03:00    257
4    2013-05-01 00:04:00    257

Now, I am working to this data out for each of the channels (structure is channel -> Time -> value ) separately, so that I can insert the channel as a column of the data set.

So, I decided to get the channel names dynamically, and search through the data.For this file, there are nine separate valid channel names, but it is not the same for all of the files (number or names).

txt1 = root.xpath('//channel/@name') #this prints all channel names!
len(txt1)
Out[67]: 9
print txt1
['blah', 'b', 'c', 'd', 'vd', 'ef', 'fg', 'kc', 'lir']

I thought I could dynamically fetch the data (using the earlier solution but adding @name=txt1[0] ) and eventually doing a for i = 0 to len(txt1), ... to go through all of them. But I get an empty data frame:

dTime=[]
dchannel = txt1[0] # can hardcode, but need to be able to get all
dvalue=[]
for df in root.xpath('//channel[@name=txt1[0]]/Time'):
    #CODE NEEDED: to get dchannel to dynamically = channel[@name]
    ## Iterate over attributes of time for specific channel
    for attrib in df.attrib:
    dTime.append(df.attrib[attrib])
    ## value is a child of time, and iterate
    subfields = df.getchildren()
    for subfield in subfields:
    dvalue.append(subfield.text)
perf=DataFrame({'Channel': dchannel,'Time':dTime,'values':dvalue})

perf

Int64Index([], dtype=int64)
Empty DataFrame

If I hard code the desired attribute, like for df in root.xpath('/*/*/*/channel[@name="blah"]/Time'): it will print it for one attribute, but I cannot get it to work referencing txt1[] .

I tried with reference to {0}..., txt1[] but then it spits out a tuple for the dchannel attribute (because it is getting all of txt1 instead of getting the txt1 attribute name that is the parent of the time node.

I looked over the XPath documentation, and I have been through the lxml tutorial, and I cannot figure out why my dynamic search does not work. Do I need to fall back to .findall() ? How can I use this dynamic search to get the data for each value in txt1 ?

There is probably a more pythonic way to approach this such as setting up a function that gets the attribute [@name] of the parent, the attribute [@Time] of the child, and then the text of the grandchild value , but I have not figured out how to do that yet.

Okay, I solved this - but solution is still ugly.

I'm glad to have figured it out to get the output I want. If anyone has a cleaner method, I would LOVE to see it. Thanks.

dTime=[]
dchannel = []
dvalue=[]
for df in root.xpath('//channel/Time'):
    dchannel.append(df.getparent().attrib['name'])
    ## Iterate over attributes of time for specific channel
    for attrib in df.attrib:
    dTime.append(df.attrib[attrib])
    ## value is a child of time, and iterate
    subfields = df.getchildren()
    for subfield in subfields:
    dvalue.append(subfield.text)
perf=DataFrame({'Channel': dchannel,'Time':dTime,'values':dvalue})

perf[:2]
   Channel     Time                    value
0    blah        2013-05-01 00:00:00    258
1    blah        2013-05-01 00:01:00    259
2    blah        2013-05-01 00:02:00    258

perf[12957:12960]
   Channel     Time                    value
12957   lir      2013-05-01 00:00:00    67
12958   lir      2013-05-01 00:01:00    67
12959   lir      2013-05-01 00:02:00    66

YAY

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM