將分層（樹狀）XML讀入pandas數據幀，保留層次結構

Question

我有一個包含層次結構樹狀結構的XML文檔，請參閱下面的示例。

該文檔包含幾個<Message>標記（為方便起見，我只復制了其中一個標記）。

每個<Message>都有自己的一些關聯數據（ id ， status ， priority ）。

此外，每個<Message>可以包含一個或多個<Street>子項，這些子項同樣具有一些相關數據（ <name> ， <length> ）。

此外，每個<Street>可以有一個或多個<Link>子項，它們也有自己的相關數據（ <id> ， <direction> ）。

示例XML文檔：

<?xml version="1.0" encoding="ISO-8859-1"?>
<Root xmlns="someNamespace">
<Messages>
<Message id='12345'>
   <status>Active</status>
   <priority>Low</priority>
   <Area>
    <Streets>
     <Street>
      <name>King Street</name>
      <length>Short</length>
       <Link>
        <id>75838745</id>
        <direction>North</direction>
       </Link>
       <Link>
        <id>168745</id>
        <direction>South</direction>
       </Link>
       <Link>
        <id>975416</id>
        <direction>North</direction>
       </Link>
     </Street>
     <Street>
      <name>Queen Street</name>
      <length>Long</length>
       <Link>
        <id>366248</id>
         <direction>West</direction>
       </Link>
       <Link>
        <id>745812</id>
         <direction>East</direction>
       </Link>
     </Street>
    </Streets>
   </Area>
</Message>
</Messages>
</Root>

使用Python解析XML並將相關數據存儲在變量中不是問題 - 我可以使用例如lxml庫並讀取整個文檔，然后執行一些xpath表達式來獲取相關字段，或者逐行讀取它iterparse方法。

但是，我想將數據放入pandas數據幀，同時保留其中的層次結構。 目標是查詢單個消息（例如，通過布爾表達式， if status == Active then get the Message with all its streets and its streets' links ）並獲取屬於特定消息的所有數據（其街道及其街道） '鏈接）。 如何才能做到最好？

我嘗試了不同的方法，但遇到了所有問題。

如果我為包含信息的每個XML行創建一個數據幀行，然后在[MessageID, StreetName, LinkID]上設置MultiIndex，我會得到一個包含大量NaN的索引（通常不鼓勵），因為MessageID不知道它的[MessageID, StreetName, LinkID] streets和links呢。 此外，我不知道如何通過布爾條件選擇一些子數據集，而不是只有一些沒有子元素的單行。

在[MessageID, StreetName, LinkID]上進行GroupBy時，我不知道如何從pandas GroupBy對象中獲取（可能是MultiIndex）數據幀，因為這里沒有任何內容可以聚合（沒有平均值/ std / sum /無論如何，值應該保持不變）。

有任何建議如何有效地處理這個問題？

Answer 1

我終於設法解決了上面描述的問題，這是怎么回事。

我將上面給出的XML文檔擴展為包含兩個消息而不是一個消息。 這就是它看起來像一個有效的Python字符串（它當然也可以從文件中加載）：

xmlDocument = '''<?xml version="1.0" encoding="ISO-8859-1"?> \
<Root> \
<Messages> \
<Message id='12345'> \
   <status>Active</status> \
   <priority>Low</priority> \
   <Area> \
    <Streets> \
     <Street> \
      <name>King Street</name> \
      <length>Short</length> \
       <Link> \
        <id>75838745</id> \
        <direction>North</direction> \
       </Link> \
       <Link> \
        <id>168745</id> \
        <direction>South</direction> \
       </Link> \
       <Link> \
        <id>975416</id> \
        <direction>North</direction> \
       </Link> \
     </Street> \
     <Street> \
      <name>Queen Street</name> \
      <length>Long</length> \
       <Link> \
        <id>366248</id> \
         <direction>West</direction> \
       </Link> \
       <Link> \
        <id>745812</id> \
         <direction>East</direction> \
       </Link> \
     </Street> \
    </Streets> \
   </Area> \
</Message> \
<Message id='54321'> \
   <status>Inactive</status> \
   <priority>High</priority> \
   <Area> \
    <Streets> \
     <Street> \
      <name>Princess Street</name> \
      <length>Mid</length> \
       <Link> \
        <id>744154</id> \
        <direction>West</direction> \
       </Link> \
       <Link> \
        <id>632214</id> \
        <direction>South</direction> \
       </Link> \
       <Link> \
        <id>654785</id> \
        <direction>East</direction> \
       </Link> \
     </Street> \
     <Street> \
      <name>Prince Street</name> \
      <length>Very Long</length> \
       <Link> \
        <id>1022444</id> \
         <direction>North</direction> \
       </Link> \
       <Link> \
        <id>4474558</id> \
         <direction>South</direction> \
       </Link> \
     </Street> \
    </Streets> \
   </Area> \
</Message> \
</Messages> \
</Root>'''

為了將分層XML結構解析為扁平的pandas數據幀，我使用了Python的ElementTree iterparse方法，該方法提供類似SAX的接口，以便在特定XML標記開始或結束時逐行遍歷XML文檔。

對於每個解析的XML行，給定的信息存儲在字典中。 使用三個字典，每個字典對應於某種方式屬於一起的每組數據（消息，街道，鏈接），並且稍后將存儲在其自己的數據幀行中。 當收集到一個這樣的行的所有信息時，字典被附加到以適當順序存儲所有行的列表中。

這就是XML解析的樣子（有關進一步說明，請參閱內聯注釋）：

# imports
import xml.etree.ElementTree as ET
import pandas as pd

# initialize parsing from Bytes buffer
from io import BytesIO
xmlDocument = BytesIO(xmlDocument.encode('utf-8'))

# initialize dictionaries storing the information to each type of row
messageRow, streetRow, linkRow = {}, {}, {}

# initialize list that stores the single dataframe rows
listOfRows = []

# read the xml file line by line and throw signal when specific tags start or end
for event, element in ET.iterparse(xmlDocument, events=('start', 'end')):

    ##########
    # get all information on the current message and store in the appropriate dictionary
    ##########

    # get current message's id attribute
    if event == 'start' and element.tag == 'Message':
        messageRow = {} # re-initialize the dictionary for the current row
        messageRow['messageId'] = element.get('id')

    # get current message's status
    if event == 'end' and element.tag == 'status':
        messageRow['status'] = element.text

    # get current message's priority
    if event == 'end' and element.tag == 'priority':
        messageRow['priority'] = element.text

    # when no more information on the current message is expected, append it to the list of rows
    if event == 'end' and element.tag == 'priority':
        listOfRows.append(messageRow)

    ##########
    # get all information on the current street and store in row dictionary
    ##########

    if event == 'end' and element.tag == 'name':
        streetRow = {} # re-initialize the dictionary for the current street row
        streetRow['streetName'] = element.text

    if event == 'end' and element.tag == 'length':
        streetRow['streetLength'] = element.text

    # when no more information on the current street is expected, append it to the list of rows
    if event == 'end' and element.tag == 'length':

        # link the street to the message it belongs to, then append
        streetRow['messageId'] = messageRow['messageId']
        listOfRows.append(streetRow)

    ##########
    # get all information on the current link and store in row dictionary
    ##########

    if event == 'end' and element.tag == 'id':
        linkRow = {} # re-initialize the dictionary for the current link row
        linkRow['linkId'] = element.text

    if event == 'end' and element.tag == 'direction':
        linkRow['direction'] = element.text

    # when no more information on the current link is expected, append it to the list of rows
    if event == 'end' and element.tag == 'direction':

        # link the link to the message it belongs to, then append
        linkRow['messageId'] = messageRow['messageId']
        listOfRows.append(linkRow)

listOfRows現在是一個字典列表，其中每個字典存儲要放入一個數據幀行的信息。 使用此列表作為數據源創建數據框可以完成

# create dataframe from list of rows and pass column order (would be random otherwise)
df = pd.DataFrame.from_records(listOfRows, columns=['messageId', 'status', 'priority', 'streetName', 'streetLength', 'linkId', 'direction'])
print(df)

並給出“原始”數據幀：

   messageId    status priority       streetName streetLength    linkId  \
0      12345    Active      Low              NaN          NaN       NaN   
1      12345       NaN      NaN      King Street        Short       NaN   
2      12345       NaN      NaN              NaN          NaN  75838745   
3      12345       NaN      NaN              NaN          NaN    168745   
4      12345       NaN      NaN              NaN          NaN    975416   
5      12345       NaN      NaN     Queen Street         Long       NaN   
6      12345       NaN      NaN              NaN          NaN    366248   
7      12345       NaN      NaN              NaN          NaN    745812   
8      54321  Inactive     High              NaN          NaN       NaN   
9      54321       NaN      NaN  Princess Street          Mid       NaN   
10     54321       NaN      NaN              NaN          NaN    744154   
11     54321       NaN      NaN              NaN          NaN    632214   
12     54321       NaN      NaN              NaN          NaN    654785   
13     54321       NaN      NaN    Prince Street    Very Long       NaN   
14     54321       NaN      NaN              NaN          NaN   1022444   
15     54321       NaN      NaN              NaN          NaN   4474558   

   direction  
0        NaN  
1        NaN  
2      North  
3      South  
4      North  
5        NaN  
6       West  
7       East  
8        NaN  
9        NaN  
10      West  
11     South  
12      East  
13       NaN  
14     North  
15     South

我們現在可以在該數據幀上將感興趣的列（messageId，streetName，linkId）作為MultiIndex：

# set the columns of interest as MultiIndex
df = df.set_index(['messageId', 'streetName', 'linkId'])
print(df)

這使：

                                      status priority streetLength direction
messageId streetName      linkId                                            
12345     NaN             NaN         Active      Low          NaN       NaN
          King Street     NaN            NaN      NaN        Short       NaN
          NaN             75838745       NaN      NaN          NaN     North
                          168745         NaN      NaN          NaN     South
                          975416         NaN      NaN          NaN     North
          Queen Street    NaN            NaN      NaN         Long       NaN
          NaN             366248         NaN      NaN          NaN      West
                          745812         NaN      NaN          NaN      East
54321     NaN             NaN       Inactive     High          NaN       NaN
          Princess Street NaN            NaN      NaN          Mid       NaN
          NaN             744154         NaN      NaN          NaN      West
                          632214         NaN      NaN          NaN     South
                          654785         NaN      NaN          NaN      East
          Prince Street   NaN            NaN      NaN    Very Long       NaN
          NaN             1022444        NaN      NaN          NaN     North
                          4474558        NaN      NaN          NaN     South

盡管在一般情況下應該忽略索引中的NaN ，但我認為這個用例沒有任何問題。

最后，為了獲得通過其messageId （包括其所有“子”街道和鏈接）訪問單個消息所需的效果，MultiIndexed數據幀必須按最外層索引級別進行分組：

# group by the most outer index
groups = df.groupby(level='messageId')

現在，您可以使用循環遍歷所有消息（並使用它們執行任何操作）

# iterate over all groups
for key, group in groups:
    print('key: ' + key)
    print('group:')
    print(group)
    print('\n')

返回

key: 12345
group:
                                 status priority streetLength direction
messageId streetName   linkId                                          
12345     NaN          NaN       Active      Low          NaN       NaN
          King Street  NaN          NaN      NaN        Short       NaN
          NaN          75838745     NaN      NaN          NaN     North
                       168745       NaN      NaN          NaN     South
                       975416       NaN      NaN          NaN     North
          Queen Street NaN          NaN      NaN         Long       NaN
          NaN          366248       NaN      NaN          NaN      West
                       745812       NaN      NaN          NaN      East


key: 54321
group:
                                     status priority streetLength direction
messageId streetName      linkId                                           
54321     NaN             NaN      Inactive     High          NaN       NaN
          Princess Street NaN           NaN      NaN          Mid       NaN
          NaN             744154        NaN      NaN          NaN      West
                          632214        NaN      NaN          NaN     South
                          654785        NaN      NaN          NaN      East
          Prince Street   NaN           NaN      NaN    Very Long       NaN
          NaN             1022444       NaN      NaN          NaN     North
                          4474558       NaN      NaN          NaN     South

或者您可以通過messageId訪問特定消息，返回包含messageId的行以及其所有專用街道和鏈接：

# get groups by key
print('specific group only:')
print(groups.get_group('54321'))

給

specific group only:
                                     status priority streetLength direction
messageId streetName      linkId                                           
54321     NaN             NaN      Inactive     High          NaN       NaN
          Princess Street NaN           NaN      NaN          Mid       NaN
          NaN             744154        NaN      NaN          NaN      West
                          632214        NaN      NaN          NaN     South
                          654785        NaN      NaN          NaN      East
          Prince Street   NaN           NaN      NaN    Very Long       NaN
          NaN             1022444       NaN      NaN          NaN     North
                          4474558       NaN      NaN          NaN     South

希望這會對某些人有所幫助。

將分層（樹狀）XML讀入pandas數據幀，保留層次結構

問題描述

1 個解決方案

解決方案1
4 已采納 2015-01-05 17:33:45

將分層（樹狀）XML讀入pandas數據幀，保留層次結構

問題描述

1 個解決方案

解決方案1 4 已采納 2015-01-05 17:33:45

解決方案1
4 已采納 2015-01-05 17:33:45