[英]Read hierarchical (tree-like) XML into a pandas dataframe, preserving hierarchy
我有一個包含層次結構樹狀結構的XML文檔,請參閱下面的示例。
該文檔包含幾個<Message>
標記(為方便起見,我只復制了其中一個標記)。
每個<Message>
都有自己的一些關聯數據( id
, status
, priority
)。
此外,每個<Message>
可以包含一個或多個<Street>
子項,這些子項同樣具有一些相關數據( <name>
, <length>
)。
此外,每個<Street>
可以有一個或多個<Link>
子項,它們也有自己的相關數據( <id>
, <direction>
)。
示例XML文檔:
<?xml version="1.0" encoding="ISO-8859-1"?>
<Root xmlns="someNamespace">
<Messages>
<Message id='12345'>
<status>Active</status>
<priority>Low</priority>
<Area>
<Streets>
<Street>
<name>King Street</name>
<length>Short</length>
<Link>
<id>75838745</id>
<direction>North</direction>
</Link>
<Link>
<id>168745</id>
<direction>South</direction>
</Link>
<Link>
<id>975416</id>
<direction>North</direction>
</Link>
</Street>
<Street>
<name>Queen Street</name>
<length>Long</length>
<Link>
<id>366248</id>
<direction>West</direction>
</Link>
<Link>
<id>745812</id>
<direction>East</direction>
</Link>
</Street>
</Streets>
</Area>
</Message>
</Messages>
</Root>
使用Python解析XML並將相關數據存儲在變量中不是問題 - 我可以使用例如lxml
庫並讀取整個文檔,然后執行一些xpath
表達式來獲取相關字段,或者逐行讀取它iterparse
方法。
但是,我想將數據放入pandas數據幀,同時保留其中的層次結構。 目標是查詢單個消息(例如,通過布爾表達式, if status == Active then get the Message with all its streets and its streets' links
)並獲取屬於特定消息的所有數據(其街道及其街道) '鏈接)。 如何才能做到最好?
我嘗試了不同的方法,但遇到了所有問題。
如果我為包含信息的每個XML行創建一個數據幀行,然后在[MessageID, StreetName, LinkID]
上設置MultiIndex,我會得到一個包含大量NaN
的索引(通常不鼓勵),因為MessageID
不知道它的[MessageID, StreetName, LinkID]
streets
和links
呢。 此外,我不知道如何通過布爾條件選擇一些子數據集,而不是只有一些沒有子元素的單行。
在[MessageID, StreetName, LinkID]
上進行GroupBy時,我不知道如何從pandas GroupBy對象中獲取(可能是MultiIndex)數據幀,因為這里沒有任何內容可以聚合(沒有平均值/ std / sum /無論如何,值應該保持不變)。
有任何建議如何有效地處理這個問題?
我終於設法解決了上面描述的問題,這是怎么回事。
我將上面給出的XML文檔擴展為包含兩個消息而不是一個消息。 這就是它看起來像一個有效的Python字符串(它當然也可以從文件中加載):
xmlDocument = '''<?xml version="1.0" encoding="ISO-8859-1"?> \
<Root> \
<Messages> \
<Message id='12345'> \
<status>Active</status> \
<priority>Low</priority> \
<Area> \
<Streets> \
<Street> \
<name>King Street</name> \
<length>Short</length> \
<Link> \
<id>75838745</id> \
<direction>North</direction> \
</Link> \
<Link> \
<id>168745</id> \
<direction>South</direction> \
</Link> \
<Link> \
<id>975416</id> \
<direction>North</direction> \
</Link> \
</Street> \
<Street> \
<name>Queen Street</name> \
<length>Long</length> \
<Link> \
<id>366248</id> \
<direction>West</direction> \
</Link> \
<Link> \
<id>745812</id> \
<direction>East</direction> \
</Link> \
</Street> \
</Streets> \
</Area> \
</Message> \
<Message id='54321'> \
<status>Inactive</status> \
<priority>High</priority> \
<Area> \
<Streets> \
<Street> \
<name>Princess Street</name> \
<length>Mid</length> \
<Link> \
<id>744154</id> \
<direction>West</direction> \
</Link> \
<Link> \
<id>632214</id> \
<direction>South</direction> \
</Link> \
<Link> \
<id>654785</id> \
<direction>East</direction> \
</Link> \
</Street> \
<Street> \
<name>Prince Street</name> \
<length>Very Long</length> \
<Link> \
<id>1022444</id> \
<direction>North</direction> \
</Link> \
<Link> \
<id>4474558</id> \
<direction>South</direction> \
</Link> \
</Street> \
</Streets> \
</Area> \
</Message> \
</Messages> \
</Root>'''
為了將分層XML結構解析為扁平的pandas數據幀,我使用了Python的ElementTree iterparse
方法,該方法提供類似SAX的接口,以便在特定XML標記開始或結束時逐行遍歷XML文檔。
對於每個解析的XML行,給定的信息存儲在字典中。 使用三個字典,每個字典對應於某種方式屬於一起的每組數據(消息,街道,鏈接),並且稍后將存儲在其自己的數據幀行中。 當收集到一個這樣的行的所有信息時,字典被附加到以適當順序存儲所有行的列表中。
這就是XML解析的樣子(有關進一步說明,請參閱內聯注釋):
# imports
import xml.etree.ElementTree as ET
import pandas as pd
# initialize parsing from Bytes buffer
from io import BytesIO
xmlDocument = BytesIO(xmlDocument.encode('utf-8'))
# initialize dictionaries storing the information to each type of row
messageRow, streetRow, linkRow = {}, {}, {}
# initialize list that stores the single dataframe rows
listOfRows = []
# read the xml file line by line and throw signal when specific tags start or end
for event, element in ET.iterparse(xmlDocument, events=('start', 'end')):
##########
# get all information on the current message and store in the appropriate dictionary
##########
# get current message's id attribute
if event == 'start' and element.tag == 'Message':
messageRow = {} # re-initialize the dictionary for the current row
messageRow['messageId'] = element.get('id')
# get current message's status
if event == 'end' and element.tag == 'status':
messageRow['status'] = element.text
# get current message's priority
if event == 'end' and element.tag == 'priority':
messageRow['priority'] = element.text
# when no more information on the current message is expected, append it to the list of rows
if event == 'end' and element.tag == 'priority':
listOfRows.append(messageRow)
##########
# get all information on the current street and store in row dictionary
##########
if event == 'end' and element.tag == 'name':
streetRow = {} # re-initialize the dictionary for the current street row
streetRow['streetName'] = element.text
if event == 'end' and element.tag == 'length':
streetRow['streetLength'] = element.text
# when no more information on the current street is expected, append it to the list of rows
if event == 'end' and element.tag == 'length':
# link the street to the message it belongs to, then append
streetRow['messageId'] = messageRow['messageId']
listOfRows.append(streetRow)
##########
# get all information on the current link and store in row dictionary
##########
if event == 'end' and element.tag == 'id':
linkRow = {} # re-initialize the dictionary for the current link row
linkRow['linkId'] = element.text
if event == 'end' and element.tag == 'direction':
linkRow['direction'] = element.text
# when no more information on the current link is expected, append it to the list of rows
if event == 'end' and element.tag == 'direction':
# link the link to the message it belongs to, then append
linkRow['messageId'] = messageRow['messageId']
listOfRows.append(linkRow)
listOfRows
現在是一個字典列表,其中每個字典存儲要放入一個數據幀行的信息。 使用此列表作為數據源創建數據框可以完成
# create dataframe from list of rows and pass column order (would be random otherwise)
df = pd.DataFrame.from_records(listOfRows, columns=['messageId', 'status', 'priority', 'streetName', 'streetLength', 'linkId', 'direction'])
print(df)
並給出“原始”數據幀:
messageId status priority streetName streetLength linkId \
0 12345 Active Low NaN NaN NaN
1 12345 NaN NaN King Street Short NaN
2 12345 NaN NaN NaN NaN 75838745
3 12345 NaN NaN NaN NaN 168745
4 12345 NaN NaN NaN NaN 975416
5 12345 NaN NaN Queen Street Long NaN
6 12345 NaN NaN NaN NaN 366248
7 12345 NaN NaN NaN NaN 745812
8 54321 Inactive High NaN NaN NaN
9 54321 NaN NaN Princess Street Mid NaN
10 54321 NaN NaN NaN NaN 744154
11 54321 NaN NaN NaN NaN 632214
12 54321 NaN NaN NaN NaN 654785
13 54321 NaN NaN Prince Street Very Long NaN
14 54321 NaN NaN NaN NaN 1022444
15 54321 NaN NaN NaN NaN 4474558
direction
0 NaN
1 NaN
2 North
3 South
4 North
5 NaN
6 West
7 East
8 NaN
9 NaN
10 West
11 South
12 East
13 NaN
14 North
15 South
我們現在可以在該數據幀上將感興趣的列(messageId,streetName,linkId)作為MultiIndex:
# set the columns of interest as MultiIndex
df = df.set_index(['messageId', 'streetName', 'linkId'])
print(df)
這使:
status priority streetLength direction
messageId streetName linkId
12345 NaN NaN Active Low NaN NaN
King Street NaN NaN NaN Short NaN
NaN 75838745 NaN NaN NaN North
168745 NaN NaN NaN South
975416 NaN NaN NaN North
Queen Street NaN NaN NaN Long NaN
NaN 366248 NaN NaN NaN West
745812 NaN NaN NaN East
54321 NaN NaN Inactive High NaN NaN
Princess Street NaN NaN NaN Mid NaN
NaN 744154 NaN NaN NaN West
632214 NaN NaN NaN South
654785 NaN NaN NaN East
Prince Street NaN NaN NaN Very Long NaN
NaN 1022444 NaN NaN NaN North
4474558 NaN NaN NaN South
盡管在一般情況下應該忽略索引中的NaN
,但我認為這個用例沒有任何問題。
最后,為了獲得通過其messageId
(包括其所有“子”街道和鏈接)訪問單個消息所需的效果,MultiIndexed數據幀必須按最外層索引級別進行分組:
# group by the most outer index
groups = df.groupby(level='messageId')
現在,您可以使用循環遍歷所有消息(並使用它們執行任何操作)
# iterate over all groups
for key, group in groups:
print('key: ' + key)
print('group:')
print(group)
print('\n')
返回
key: 12345
group:
status priority streetLength direction
messageId streetName linkId
12345 NaN NaN Active Low NaN NaN
King Street NaN NaN NaN Short NaN
NaN 75838745 NaN NaN NaN North
168745 NaN NaN NaN South
975416 NaN NaN NaN North
Queen Street NaN NaN NaN Long NaN
NaN 366248 NaN NaN NaN West
745812 NaN NaN NaN East
key: 54321
group:
status priority streetLength direction
messageId streetName linkId
54321 NaN NaN Inactive High NaN NaN
Princess Street NaN NaN NaN Mid NaN
NaN 744154 NaN NaN NaN West
632214 NaN NaN NaN South
654785 NaN NaN NaN East
Prince Street NaN NaN NaN Very Long NaN
NaN 1022444 NaN NaN NaN North
4474558 NaN NaN NaN South
或者您可以通過messageId訪問特定消息,返回包含messageId的行以及其所有專用街道和鏈接:
# get groups by key
print('specific group only:')
print(groups.get_group('54321'))
給
specific group only:
status priority streetLength direction
messageId streetName linkId
54321 NaN NaN Inactive High NaN NaN
Princess Street NaN NaN NaN Mid NaN
NaN 744154 NaN NaN NaN West
632214 NaN NaN NaN South
654785 NaN NaN NaN East
Prince Street NaN NaN NaN Very Long NaN
NaN 1022444 NaN NaN NaN North
4474558 NaN NaN NaN South
希望這會對某些人有所幫助。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.