将分层（树状）XML读入pandas数据帧，保留层次结构

Question

我有一个包含层次结构树状结构的XML文档，请参阅下面的示例。

该文档包含几个<Message>标记（为方便起见，我只复制了其中一个标记）。

每个<Message>都有自己的一些关联数据（ id ， status ， priority ）。

此外，每个<Message>可以包含一个或多个<Street>子项，这些子项同样具有一些相关数据（ <name> ， <length> ）。

此外，每个<Street>可以有一个或多个<Link>子项，它们也有自己的相关数据（ <id> ， <direction> ）。

示例XML文档：

<?xml version="1.0" encoding="ISO-8859-1"?>
<Root xmlns="someNamespace">
<Messages>
<Message id='12345'>
   <status>Active</status>
   <priority>Low</priority>
   <Area>
    <Streets>
     <Street>
      <name>King Street</name>
      <length>Short</length>
       <Link>
        <id>75838745</id>
        <direction>North</direction>
       </Link>
       <Link>
        <id>168745</id>
        <direction>South</direction>
       </Link>
       <Link>
        <id>975416</id>
        <direction>North</direction>
       </Link>
     </Street>
     <Street>
      <name>Queen Street</name>
      <length>Long</length>
       <Link>
        <id>366248</id>
         <direction>West</direction>
       </Link>
       <Link>
        <id>745812</id>
         <direction>East</direction>
       </Link>
     </Street>
    </Streets>
   </Area>
</Message>
</Messages>
</Root>

使用Python解析XML并将相关数据存储在变量中不是问题 - 我可以使用例如lxml库并读取整个文档，然后执行一些xpath表达式来获取相关字段，或者逐行读取它iterparse方法。

但是，我想将数据放入pandas数据帧，同时保留其中的层次结构。 目标是查询单个消息（例如，通过布尔表达式， if status == Active then get the Message with all its streets and its streets' links ）并获取属于特定消息的所有数据（其街道及其街道） '链接）。 如何才能做到最好？

我尝试了不同的方法，但遇到了所有问题。

如果我为包含信息的每个XML行创建一个数据帧行，然后在[MessageID, StreetName, LinkID]上设置MultiIndex，我会得到一个包含大量NaN的索引（通常不鼓励），因为MessageID不知道它的[MessageID, StreetName, LinkID] streets和links呢。 此外，我不知道如何通过布尔条件选择一些子数据集，而不是只有一些没有子元素的单行。

在[MessageID, StreetName, LinkID]上进行GroupBy时，我不知道如何从pandas GroupBy对象中获取（可能是MultiIndex）数据帧，因为这里没有任何内容可以聚合（没有平均值/ std / sum /无论如何，值应该保持不变）。

有任何建议如何有效地处理这个问题？

Answer 1

我终于设法解决了上面描述的问题，这是怎么回事。

我将上面给出的XML文档扩展为包含两个消息而不是一个消息。 这就是它看起来像一个有效的Python字符串（它当然也可以从文件中加载）：

xmlDocument = '''<?xml version="1.0" encoding="ISO-8859-1"?> \
<Root> \
<Messages> \
<Message id='12345'> \
   <status>Active</status> \
   <priority>Low</priority> \
   <Area> \
    <Streets> \
     <Street> \
      <name>King Street</name> \
      <length>Short</length> \
       <Link> \
        <id>75838745</id> \
        <direction>North</direction> \
       </Link> \
       <Link> \
        <id>168745</id> \
        <direction>South</direction> \
       </Link> \
       <Link> \
        <id>975416</id> \
        <direction>North</direction> \
       </Link> \
     </Street> \
     <Street> \
      <name>Queen Street</name> \
      <length>Long</length> \
       <Link> \
        <id>366248</id> \
         <direction>West</direction> \
       </Link> \
       <Link> \
        <id>745812</id> \
         <direction>East</direction> \
       </Link> \
     </Street> \
    </Streets> \
   </Area> \
</Message> \
<Message id='54321'> \
   <status>Inactive</status> \
   <priority>High</priority> \
   <Area> \
    <Streets> \
     <Street> \
      <name>Princess Street</name> \
      <length>Mid</length> \
       <Link> \
        <id>744154</id> \
        <direction>West</direction> \
       </Link> \
       <Link> \
        <id>632214</id> \
        <direction>South</direction> \
       </Link> \
       <Link> \
        <id>654785</id> \
        <direction>East</direction> \
       </Link> \
     </Street> \
     <Street> \
      <name>Prince Street</name> \
      <length>Very Long</length> \
       <Link> \
        <id>1022444</id> \
         <direction>North</direction> \
       </Link> \
       <Link> \
        <id>4474558</id> \
         <direction>South</direction> \
       </Link> \
     </Street> \
    </Streets> \
   </Area> \
</Message> \
</Messages> \
</Root>'''

为了将分层XML结构解析为扁平的pandas数据帧，我使用了Python的ElementTree iterparse方法，该方法提供类似SAX的接口，以便在特定XML标记开始或结束时逐行遍历XML文档。

对于每个解析的XML行，给定的信息存储在字典中。 使用三个字典，每个字典对应于某种方式属于一起的每组数据（消息，街道，链接），并且稍后将存储在其自己的数据帧行中。 当收集到一个这样的行的所有信息时，字典被附加到以适当顺序存储所有行的列表中。

这就是XML解析的样子（有关进一步说明，请参阅内联注释）：

# imports
import xml.etree.ElementTree as ET
import pandas as pd

# initialize parsing from Bytes buffer
from io import BytesIO
xmlDocument = BytesIO(xmlDocument.encode('utf-8'))

# initialize dictionaries storing the information to each type of row
messageRow, streetRow, linkRow = {}, {}, {}

# initialize list that stores the single dataframe rows
listOfRows = []

# read the xml file line by line and throw signal when specific tags start or end
for event, element in ET.iterparse(xmlDocument, events=('start', 'end')):

    ##########
    # get all information on the current message and store in the appropriate dictionary
    ##########

    # get current message's id attribute
    if event == 'start' and element.tag == 'Message':
        messageRow = {} # re-initialize the dictionary for the current row
        messageRow['messageId'] = element.get('id')

    # get current message's status
    if event == 'end' and element.tag == 'status':
        messageRow['status'] = element.text

    # get current message's priority
    if event == 'end' and element.tag == 'priority':
        messageRow['priority'] = element.text

    # when no more information on the current message is expected, append it to the list of rows
    if event == 'end' and element.tag == 'priority':
        listOfRows.append(messageRow)

    ##########
    # get all information on the current street and store in row dictionary
    ##########

    if event == 'end' and element.tag == 'name':
        streetRow = {} # re-initialize the dictionary for the current street row
        streetRow['streetName'] = element.text

    if event == 'end' and element.tag == 'length':
        streetRow['streetLength'] = element.text

    # when no more information on the current street is expected, append it to the list of rows
    if event == 'end' and element.tag == 'length':

        # link the street to the message it belongs to, then append
        streetRow['messageId'] = messageRow['messageId']
        listOfRows.append(streetRow)

    ##########
    # get all information on the current link and store in row dictionary
    ##########

    if event == 'end' and element.tag == 'id':
        linkRow = {} # re-initialize the dictionary for the current link row
        linkRow['linkId'] = element.text

    if event == 'end' and element.tag == 'direction':
        linkRow['direction'] = element.text

    # when no more information on the current link is expected, append it to the list of rows
    if event == 'end' and element.tag == 'direction':

        # link the link to the message it belongs to, then append
        linkRow['messageId'] = messageRow['messageId']
        listOfRows.append(linkRow)

listOfRows现在是一个字典列表，其中每个字典存储要放入一个数据帧行的信息。 使用此列表作为数据源创建数据框可以完成

# create dataframe from list of rows and pass column order (would be random otherwise)
df = pd.DataFrame.from_records(listOfRows, columns=['messageId', 'status', 'priority', 'streetName', 'streetLength', 'linkId', 'direction'])
print(df)

并给出“原始”数据帧：

   messageId    status priority       streetName streetLength    linkId  \
0      12345    Active      Low              NaN          NaN       NaN   
1      12345       NaN      NaN      King Street        Short       NaN   
2      12345       NaN      NaN              NaN          NaN  75838745   
3      12345       NaN      NaN              NaN          NaN    168745   
4      12345       NaN      NaN              NaN          NaN    975416   
5      12345       NaN      NaN     Queen Street         Long       NaN   
6      12345       NaN      NaN              NaN          NaN    366248   
7      12345       NaN      NaN              NaN          NaN    745812   
8      54321  Inactive     High              NaN          NaN       NaN   
9      54321       NaN      NaN  Princess Street          Mid       NaN   
10     54321       NaN      NaN              NaN          NaN    744154   
11     54321       NaN      NaN              NaN          NaN    632214   
12     54321       NaN      NaN              NaN          NaN    654785   
13     54321       NaN      NaN    Prince Street    Very Long       NaN   
14     54321       NaN      NaN              NaN          NaN   1022444   
15     54321       NaN      NaN              NaN          NaN   4474558   

   direction  
0        NaN  
1        NaN  
2      North  
3      South  
4      North  
5        NaN  
6       West  
7       East  
8        NaN  
9        NaN  
10      West  
11     South  
12      East  
13       NaN  
14     North  
15     South

我们现在可以在该数据帧上将感兴趣的列（messageId，streetName，linkId）作为MultiIndex：

# set the columns of interest as MultiIndex
df = df.set_index(['messageId', 'streetName', 'linkId'])
print(df)

这使：

                                      status priority streetLength direction
messageId streetName      linkId                                            
12345     NaN             NaN         Active      Low          NaN       NaN
          King Street     NaN            NaN      NaN        Short       NaN
          NaN             75838745       NaN      NaN          NaN     North
                          168745         NaN      NaN          NaN     South
                          975416         NaN      NaN          NaN     North
          Queen Street    NaN            NaN      NaN         Long       NaN
          NaN             366248         NaN      NaN          NaN      West
                          745812         NaN      NaN          NaN      East
54321     NaN             NaN       Inactive     High          NaN       NaN
          Princess Street NaN            NaN      NaN          Mid       NaN
          NaN             744154         NaN      NaN          NaN      West
                          632214         NaN      NaN          NaN     South
                          654785         NaN      NaN          NaN      East
          Prince Street   NaN            NaN      NaN    Very Long       NaN
          NaN             1022444        NaN      NaN          NaN     North
                          4474558        NaN      NaN          NaN     South

尽管在一般情况下应该忽略索引中的NaN ，但我认为这个用例没有任何问题。

最后，为了获得通过其messageId （包括其所有“子”街道和链接）访问单个消息所需的效果，MultiIndexed数据帧必须按最外层索引级别进行分组：

# group by the most outer index
groups = df.groupby(level='messageId')

现在，您可以使用循环遍历所有消息（并使用它们执行任何操作）

# iterate over all groups
for key, group in groups:
    print('key: ' + key)
    print('group:')
    print(group)
    print('\n')

返回

key: 12345
group:
                                 status priority streetLength direction
messageId streetName   linkId                                          
12345     NaN          NaN       Active      Low          NaN       NaN
          King Street  NaN          NaN      NaN        Short       NaN
          NaN          75838745     NaN      NaN          NaN     North
                       168745       NaN      NaN          NaN     South
                       975416       NaN      NaN          NaN     North
          Queen Street NaN          NaN      NaN         Long       NaN
          NaN          366248       NaN      NaN          NaN      West
                       745812       NaN      NaN          NaN      East


key: 54321
group:
                                     status priority streetLength direction
messageId streetName      linkId                                           
54321     NaN             NaN      Inactive     High          NaN       NaN
          Princess Street NaN           NaN      NaN          Mid       NaN
          NaN             744154        NaN      NaN          NaN      West
                          632214        NaN      NaN          NaN     South
                          654785        NaN      NaN          NaN      East
          Prince Street   NaN           NaN      NaN    Very Long       NaN
          NaN             1022444       NaN      NaN          NaN     North
                          4474558       NaN      NaN          NaN     South

或者您可以通过messageId访问特定消息，返回包含messageId的行以及其所有专用街道和链接：

# get groups by key
print('specific group only:')
print(groups.get_group('54321'))

给

specific group only:
                                     status priority streetLength direction
messageId streetName      linkId                                           
54321     NaN             NaN      Inactive     High          NaN       NaN
          Princess Street NaN           NaN      NaN          Mid       NaN
          NaN             744154        NaN      NaN          NaN      West
                          632214        NaN      NaN          NaN     South
                          654785        NaN      NaN          NaN      East
          Prince Street   NaN           NaN      NaN    Very Long       NaN
          NaN             1022444       NaN      NaN          NaN     North
                          4474558       NaN      NaN          NaN     South

希望这会对某些人有所帮助。

将分层（树状）XML读入pandas数据帧，保留层次结构

问题描述

1 个解决方案

解决方案1
4 已采纳 2015-01-05 17:33:45

将分层（树状）XML读入pandas数据帧，保留层次结构

问题描述

1 个解决方案

解决方案1 4 已采纳 2015-01-05 17:33:45

解决方案1
4 已采纳 2015-01-05 17:33:45