简体   繁体   English

在python中使用lxml和xpath通过xml属性动态搜索

[英]Dynamic search through xml attributes using lxml and xpath in python

I am working to move nexted xml data into a hierarchical data frame. 我正在努力将下一个xml数据移动到分层数据框架中。 I was able to get all of the data out of the xml thanks to help on SO. 感谢SO的帮助,我能够从xml中获取所有数据。 However, now, I am working to clean up the data that I extract and shape it before output because I will be doing this thousands of times. 但是,现在,我要清理提取的数据并在输出之前对其进行整形,因为我将做数千次。

UPDATED: THIS IS WHAT I EVENTUALLY WANT OUT. 更新:这就是我想要的。 I cannot seem to fetch just the Time and value for channel dynamically. 我似乎无法动态获取channelTimevalue The channel names will change for each file. 频道名称将随每个文件而变化。

When channel = txt1[0] (for this file, txt1[0]="blah" ) through when channel = txt1[8] (for this file, txt1[8]="lir" ) channel = txt1[0] (对于此文件, txt1[0]="blah" )到channel = txt1[8] (对于此文件, txt1[8]="lir"

    channel      Time                    value
0     blah     2013-05-01 00:00:00    258
1     blah     2013-05-01 00:01:00    259
...
n-2   lir      2013-05-01 23:57:00    58
n-1   lir      2013-05-01 23:58:00    37
n     lir      2013-05-01 23:59:00    32

Here is how my xml file is fetched and structured: 这是我的xml文件的获取和结构方式:

import requests
from lxml import etree, objectify
r = requests.get('https://api.stuff.us/place/getData?   security_key=key&period=minutes&startTime=2013-05-01T00:00&endTime=2013-05-01T23:59&sort=channel') #edited for privacy
root = etree.fromstring(r.text)
xml_new = etree.tostring(root, pretty_print=True)
print xml_new[300:900] #gives xml output to show structure
<startTime>2013-05-01 00:00:00</startTime>
<endTime>2013-05-01 23:59:00</endTime>
<summaryPeriod>minutes</summaryPeriod>
<data>
  <channel channel="97925" name="blah"> 
    <Time Time="2013-05-01 00:00:00">
      <value>258</value>
    </Time>
    <Time Time="2013-05-01 00:01:00">
      <value>259</value>
    </Time>
    <Time Time="2013-05-01 00:02:00">
      <value>258</value>
    </Time>
    <Time Time="2013-05-01 00:03:00">
      <value>257</value>
    </Time>

Yesterday, I asked here on SO and solved the problem of getting the time and value values into a data frame: Parsing xml to pandas data frame throws memory error 昨天,我在SO上问过此问题,并解决了将timevalue值放入数据帧的问题:将xml解析为pandas数据帧会引发内存错误

dTime=[]
dvalue=[]
for df in root.xpath('//channel/Time'):
    ## Iterate over attributes of Time but Time only has one attrib [@Time]
    for attrib in df.attrib:
    dTime.append(df.attrib[attrib])
    ## value is a child of time, and iterate
    subfields = df.getchildren()
    for subfield in subfields:
    dvalue.append(subfield.text)
pef=DataFrame({'Time':dTime,'values':dvalue})

pef

&ltclass 'pandas.core.frame.DataFrame'&gt
Int64Index: 12960 entries, 0 to 12959
Data columns (total 2 columns):
Time     12960  non-null values
value    12960  non-null values
dtypes: object(2) 

pef[:5]

    Time                    value
0    2013-05-01 00:00:00    258
1    2013-05-01 00:01:00    259
2    2013-05-01 00:02:00    258
3    2013-05-01 00:03:00    257
4    2013-05-01 00:04:00    257

Now, I am working to this data out for each of the channels (structure is channel -> Time -> value ) separately, so that I can insert the channel as a column of the data set. 现在,我正在分别处理每个通道的数据(结构是channel > Time > value ),以便可以将通道插入为数据集的列。

So, I decided to get the channel names dynamically, and search through the data.For this file, there are nine separate valid channel names, but it is not the same for all of the files (number or names). 因此,我决定动态获取通道名称并搜索数据。对于此文件,有九个单独的有效通道名称,但对于所有文件(编号或名称)来说都不相同。

txt1 = root.xpath('//channel/@name') #this prints all channel names!
len(txt1)
Out[67]: 9
print txt1
['blah', 'b', 'c', 'd', 'vd', 'ef', 'fg', 'kc', 'lir']

I thought I could dynamically fetch the data (using the earlier solution but adding @name=txt1[0] ) and eventually doing a for i = 0 to len(txt1), ... to go through all of them. 我以为我可以动态获取数据(使用较早的解决方案,但添加@name=txt1[0] )并最终for i = 0 to len(txt1), ...以遍历所有数据。 But I get an empty data frame: 但是我得到一个空的数据框:

dTime=[]
dchannel = txt1[0] # can hardcode, but need to be able to get all
dvalue=[]
for df in root.xpath('//channel[@name=txt1[0]]/Time'):
    #CODE NEEDED: to get dchannel to dynamically = channel[@name]
    ## Iterate over attributes of time for specific channel
    for attrib in df.attrib:
    dTime.append(df.attrib[attrib])
    ## value is a child of time, and iterate
    subfields = df.getchildren()
    for subfield in subfields:
    dvalue.append(subfield.text)
perf=DataFrame({'Channel': dchannel,'Time':dTime,'values':dvalue})

perf

Int64Index([], dtype=int64)
Empty DataFrame

If I hard code the desired attribute, like for df in root.xpath('/*/*/*/channel[@name="blah"]/Time'): it will print it for one attribute, but I cannot get it to work referencing txt1[] . 如果我对所需的属性进行硬编码,例如for df in root.xpath('/*/*/*/channel[@name="blah"]/Time'):它将为一个属性打印它,但我无法获取它可以参考txt1[]来工作。

I tried with reference to {0}..., txt1[] but then it spits out a tuple for the dchannel attribute (because it is getting all of txt1 instead of getting the txt1 attribute name that is the parent of the time node. 我尝试参考{0}..., txt1[]但是随后它为dchannel属性吐出一个元组(因为它获取了txt1的全部,而不是获取时间节点的父级的txt1属性名。

I looked over the XPath documentation, and I have been through the lxml tutorial, and I cannot figure out why my dynamic search does not work. 我查看了XPath文档,并阅读了lxml教程,但无法弄清楚为什么我的动态搜索不起作用。 Do I need to fall back to .findall() ? 我需要退回到.findall()吗? How can I use this dynamic search to get the data for each value in txt1 ? 如何使用此动态搜索获取txt1每个值的数据?

There is probably a more pythonic way to approach this such as setting up a function that gets the attribute [@name] of the parent, the attribute [@Time] of the child, and then the text of the grandchild value , but I have not figured out how to do that yet. 可能有一种更Python化的方法来实现此目的,例如设置一个函数,该函数获取父级的属性[@name] ,子级的属性[@Time]以及孙子value的文本,但是我有还不知道该怎么做。

Okay, I solved this - but solution is still ugly. 好的,我解决了这个问题-但解决方案仍然很难看。

I'm glad to have figured it out to get the output I want. 我很高兴弄清楚它能得到我想要的输出。 If anyone has a cleaner method, I would LOVE to see it. 如果有人有更清洁的方法,我希望看到它。 Thanks. 谢谢。

dTime=[]
dchannel = []
dvalue=[]
for df in root.xpath('//channel/Time'):
    dchannel.append(df.getparent().attrib['name'])
    ## Iterate over attributes of time for specific channel
    for attrib in df.attrib:
    dTime.append(df.attrib[attrib])
    ## value is a child of time, and iterate
    subfields = df.getchildren()
    for subfield in subfields:
    dvalue.append(subfield.text)
perf=DataFrame({'Channel': dchannel,'Time':dTime,'values':dvalue})

perf[:2]
   Channel     Time                    value
0    blah        2013-05-01 00:00:00    258
1    blah        2013-05-01 00:01:00    259
2    blah        2013-05-01 00:02:00    258

perf[12957:12960]
   Channel     Time                    value
12957   lir      2013-05-01 00:00:00    67
12958   lir      2013-05-01 00:01:00    67
12959   lir      2013-05-01 00:02:00    66

YAY 好极了

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM