简体   繁体   English

从 XML 数据创建 pandas 数据帧

[英]Creating a pandas data frame from XML data

I am dealing with an XML data file that has the tracking data of players during a football match.我正在处理一个 XML 数据文件,其中包含足球比赛期间球员的跟踪数据。 See a snippet the top of the XML data file:查看 XML 数据文件顶部的片段:

<?xml version="1.0" encoding="utf-8"?>
<Tracking update="2017-01-23T14:41:26">
  <Match id="2019285" dateMatch="2016-09-13T18:45:00" matchNumber="13">
    <Competition id="20159" name="UEFA Champions League 2016/2017" />
    <Stadium id="85265" name="Estádio do SL Benfica" pitchLength="10500" pitchWidth="6800" />
    <Phases>
      <Phase start="2016-09-13T18:45:35.245" end="2016-09-13T19:31:49.09" leftTeamID="50157" />
      <Phase start="2016-09-13T19:47:39.336" end="2016-09-13T20:37:10.591" leftTeamID="50147" />
    </Phases>
    <Frames>
      <Frame utc="2016-09-13T18:45:35.272" isBallInPlay="0">
        <Objs>
          <Obj type="7" id="0" x="-46" y="-2562" z="0" sampling="0" />
          <Obj type="0" id="105823" x="939" y="113" sampling="0" />
          <Obj type="0" id="250086090" x="1194" y="1425" sampling="0" />
          <Obj type="0" id="250080473" x="37" y="2875" sampling="0" />
          <Obj type="0" id="250054760" x="329" y="833" sampling="0" />
          <Obj type="1" id="98593" x="-978" y="654" sampling="0" />
          <Obj type="0" id="250075765" x="1724" y="392" sampling="0" />
          <Obj type="1" id="53733" x="-4702" y="45" sampling="0" />
          <Obj type="0" id="250101112" x="54" y="1436" sampling="0" />
          <Obj type="1" id="250017920" x="-46" y="-2562" sampling="0" />
          <Obj type="1" id="105588" x="-1449" y="209" sampling="0" />
          <Obj type="1" id="250003757" x="-2395" y="-308" sampling="0" />
          <Obj type="1" id="101473" x="-690" y="-644" sampling="0" />
          <Obj type="0" id="250075775" x="2069" y="-895" sampling="0" />
          <Obj type="1" id="103695" x="-1654" y="-2022" sampling="0" />
          <Obj type="0" id="250073809" x="4712" y="-16" sampling="0" />
          <Obj type="1" id="63733" x="-2393" y="1145" sampling="0" />
          <Obj type="0" id="250015755" x="-42" y="31" sampling="0" />
          <Obj type="0" id="250055905" x="1437" y="-2791" sampling="0" />
          <Obj type="0" id="250042422" x="1169" y="-1250" sampling="0" />
        </Objs>
      </Frame>
      <Frame utc="2016-09-13T18:45:35.319" isBallInPlay="0">
        <Objs>
          <Obj type="7" id="0" x="-46" y="-2558" z="0" sampling="0" />
          <Obj type="0" id="105823" x="938" y="113" sampling="0" />
          <Obj type="0" id="250086090" x="1198" y="1426" sampling="0" />
          <Obj type="0" id="250080473" x="36" y="2874" sampling="0" />
          <Obj type="0" id="250054760" x="330" y="833" sampling="0" />
          <Obj type="1" id="98593" x="-980" y="654" sampling="0" />
          <Obj type="0" id="250075765" x="1727" y="393" sampling="0" />
          <Obj type="1" id="53733" x="-4712" y="44" sampling="0" />
          <Obj type="0" id="250101112" x="54" y="1435" sampling="0" />
          <Obj type="1" id="250017920" x="-46" y="-2558" sampling="0" />
          <Obj type="1" id="105588" x="-1449" y="209" sampling="0" />
          <Obj type="1" id="250003757" x="-2396" y="-310" sampling="0" />
          <Obj type="1" id="101473" x="-692" y="-645" sampling="0" />
          <Obj type="0" id="250075775" x="2071" y="-896" sampling="0" />
          <Obj type="1" id="103695" x="-1655" y="-2016" sampling="0" />
          <Obj type="0" id="250073809" x="4712" y="-17" sampling="0" />
          <Obj type="1" id="63733" x="-2395" y="1145" sampling="0" />
          <Obj type="0" id="250015755" x="-42" y="29" sampling="0" />
          <Obj type="0" id="250055905" x="1435" y="-2793" sampling="0" />
          <Obj type="0" id="250042422" x="1169" y="-1250" sampling="0" />
        </Objs>
      </Frame>
    </Frames>
  </Match>
</Tracking>

From my understanding this is how I have broken down the file:据我了解,这就是我分解文件的方式:

  • The root file is Tracking根文件是跟踪
  • Match is the child of Tracking匹配是跟踪的孩子
  • Competition, Stadium, Phases and Frames are the children of Match Competition、Stadium、Phases 和 Frames 是 Match 的子项
  • Phase is the child of Phases. Phase 是 Phases 的子节点。
  • Frame is the child of Frames.框架是框架的孩子。
  • There are many Frame children within Frames. Frames 中有许多 Frame 子项。 In fact, there is a Frame child for every 45milliseconds of the entire football game.事实上,整个足球比赛的每 45 毫秒都有一个 Frame 孩子。 Within each Frame child, there are the player positions for each player, referees and the ball.在每个 Frame 子项中,都有每个球员、裁判和球的球员位置。 The actual file continues for thousands and thousands of lines of data.实际文件继续包含成千上万行数据。 But this snippet is only the first two frames.但是这个片段只是前两帧。

I am trying to run the following code to see all the data in the match child:我正在尝试运行以下代码来查看匹配子项中的所有数据:

for x in myroot[0]:
        print(x.tag,x.attrib,x.text)

This is the output:这是 output:

Competition {'id': '20159', 'name': 'UEFA Champions League 2016/2017'} None
Stadium {'id': '85265', 'name': 'Estádio do SL Benfica', 'pitchLength': '10500', 'pitchWidth': '6800'} None
Phases {} 

Frames {} 

As you can see, the output is two empty dictionaries for phases and frames.如您所见,output 是两个空字典,分别用于相位和帧。 How would I get the data from these children?我如何从这些孩子那里获得数据?

Furthermore, my next challenge is trying to get this data into a pandas data frame, how would I go about doing this?此外,我的下一个挑战是尝试将这些数据放入 pandas 数据帧中,我将如何 go 这样做?

I would want the pandas date frame to look something like this (example of two frames but would want it for every frame):我希望 pandas 日期框架看起来像这样(两个框架的示例,但每个框架都需要它):

Expected output预期 output

I used the xml etree module to iterate through the xml and pull the relevant data.我使用xml etree模块遍历xml并拉取相关数据。 comments are in the code below to explain the process: Have a look at it, and play with the code.注释在下面的代码中以解释该过程: 看看它,并与代码一起玩。 Hopefully, it fits ur use case希望它适合您的用例

import xml.etree.ElementTree as ET
from collections import defaultdict

d = defaultdict(list)
#since u r reading from a file,
# root should be root = ET.parse('filename.xml').getroot()
#mine is wrapped in a string hence :
 root = ET.fromstring(data)
#required data is in the Frame section
for ent in root.findall('./Match//Frame'):
    #this gets us the timestamp
    Frame = ent.attrib['utc']
    for entry in ent.findall('Objs/Obj'):
        #append the objects to the relevant timestamp
        d[Frame].append(entry.attrib)

df = (pd.concat((pd.DataFrame(value) #create dataframe of the values
                 .assign(Frame=key) #assign keys to the dataframe
                 .filter(['id','Frame','x','y','z']) #keep only required columns
                 for key, value in d.items()),
                axis=1) #concatenate on the columns axis
     )

df.head()

id  Frame   x   y   z   id  Frame   x   y   z
0   0   2016-09-13T18:45:35.272 -46 -2562   0   0   2016-09-13T18:45:35.319 -46 -2558   0
1   105823  2016-09-13T18:45:35.272 939 113 NaN 105823  2016-09-13T18:45:35.319 938 113 NaN
2   250086090   2016-09-13T18:45:35.272 1194    1425    NaN 250086090   2016-09-13T18:45:35.319 1198    1426    NaN
3   250080473   2016-09-13T18:45:35.272 37  2875    NaN 250080473   2016-09-13T18:45:35.319 36  2874    NaN
4   250054760   2016-09-13T18:45:35.272 329 833 NaN 250054760   2016-09-13T18:45:35.319 330 833 NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM