簡體   English   中英

將復雜的XML文件轉換為R中的數據框

[英]Convert complex XML file to data frame in R

我正在嘗試將XML文件轉換為R中的數據幀。這是xml文件的示例:

<Games timestamp="2016-12-02T09:06:51">
<Game id="853139" away_team_id="143" away_team_name="Lyon" competition_id="24" competition_name="French Ligue 1" game_date="2016-08-14T14:00:00" home_team_id="148" home_team_name="Nancy" matchday="1" period_1_start="2016-08-14T14:00:25" period_2_start="2016-08-14T15:02:29" season_id="2016" season_name="Season 2016/2017">
<Event id="1195160021" event_id="1" type_id="34" period_id="16" min="0" sec="0" team_id="143" outcome="1" x="0.0" y="0.0" timestamp="2016-08-14T13:08:34.349" last_modified="2016-08-14T13:59:59" version="1471179598746">
  <Q id="1117749718" qualifier_id="194" value="59963" />
  <Q id="1807420796" qualifier_id="30" value="59957, 54772, 37832, 59963, 44488, 52775, 169007, 168568, 59966, 166552, 149519, 220560, 173211, 55305, 107641, 37852, 59956, 71389" />
  <Q id="450557206" qualifier_id="197" value="645" />
  <Q id="1671039854" qualifier_id="131" value="1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 0, 0, 0, 0, 0, 0" />
  <Q id="108315093" qualifier_id="227" value="0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0" />
  <Q id="582175015" qualifier_id="44" value="1, 2, 2, 3, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5" />
  <Q id="1069121575" qualifier_id="130" value="4" />
  <Q id="459298302" qualifier_id="59" value="1, 20, 15, 21, 2, 3, 14, 8, 10, 18, 27, 22, 4, 7, 12, 28, 30, 31" />
</Event>
<Event id="2066606636" event_id="1" type_id="34" period_id="16" min="0" sec="0" team_id="148" outcome="1" x="0.0" y="0.0" timestamp="2016-08-14T13:08:35.580" last_modified="2016-08-14T15:03:52" version="1471183432594">
  <Q id="891471807" qualifier_id="194" value="171101" />
  <Q id="201984211" qualifier_id="30" value="38816, 80799, 43024, 9980, 170034, 171101, 210460, 214472, 51327, 38008, 97290, 63600, 152337, 209874, 44314, 214473, 93498, 54911" />
  <Q id="478809608" qualifier_id="131" value="1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 0, 0, 0, 0, 0, 0" />
  <Q id="974533808" qualifier_id="227" value="0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0" />
  <Q id="193300652" qualifier_id="44" value="1, 2, 2, 3, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5" />
  <Q id="1493018979" qualifier_id="130" value="4" />
  <Q id="454462015" qualifier_id="59" value="16, 14, 26, 25, 4, 2, 13, 6, 9, 7, 23, 1, 3, 8, 12, 17, 19, 28" />
</Event>
<Event id="931188097" event_id="2" type_id="32" period_id="1" min="0" sec="0" team_id="143" outcome="1" x="0.0" y="0.0" timestamp="2016-08-14T14:00:25.556" last_modified="2016-08-14T14:00:26" version="1471179625559">
  <Q id="674324086" qualifier_id="127" value="Right to Left" />
</Event>
<Event id="704339764" event_id="2" type_id="32" period_id="1" min="0" sec="0" team_id="148" outcome="1" x="0.0" y="0.0" timestamp="2016-08-14T14:00:25.556" last_modified="2016-08-14T14:00:27" version="1471179626429">
  <Q id="2090199938" qualifier_id="127" value="Left to Right" />
</Event>
</Game>
</Games>

我嘗試使用“ XML”和“ xml2”包,但沒有什么結論性的,因為我對xml文件一點都不熟悉。

x = read_xml("f24-24-2016-853139-eventdetails.xml")

x_list = as_list(x)

x_df <- x_list %>% map('Game') %>% flatten() %>% map_df(flatten)

有人可以解釋如何處理此類文件以及如何將其轉換為R中的數據幀嗎? 謝謝

根據您的樣本數據,這里有一個關於如何從xml中提取數據以構建data.frame的想法。

“技巧”是從所需的“最低”節點開始,然后使用xpath -operators獲取相關的父/祖先/兄弟節點...在xml中,“最低”節點是Q節點,因此這些,並在那里工作

#load libraries
library(xml2)
library(magrittr) #for the pipe-operator

#load xml
doc = read_xml("f24-24-2016-853139-eventdetails.xml")

#get all Q-nodes
 q.nodes <- xml_find_all( doc, ".//Q" )

#build data frame
 result <- data.frame( Game.id = q.nodes %>% 
                         xml_find_first( ".//ancestor::Game" ) %>% 
                         xml_attr( "id" ) %>% 
                         as.numeric(),
                       Game.away_team_name = q.nodes %>% 
                         xml_find_first( ".//ancestor::Game" ) %>% 
                         xml_attr( "away_team_name" ),
                       Game.home_team_name = q.nodes %>% 
                         xml_find_first( ".//ancestor::Game" ) %>% 
                         xml_attr( "home_team_name" ),
                       Event.id = q.nodes %>% 
                         xml_find_first( ".//parent::Event" ) %>% 
                         xml_attr( "id" ) %>% 
                         as.numeric(),
                       Q.id = q.nodes %>% 
                         xml_attr( "id" ),
                       Q.qualifier_id = q.nodes %>% 
                         xml_attr( "qualifier_id" ) %>% 
                         as.numeric(), 
                       stringsAsFactors = FALSE )

結果

head( result )
#   Game.id Game.away_team_name Game.home_team_name   Event.id       Q.id Q.qualifier_id
# 1  853139                Lyon               Nancy 1195160021 1117749718            194
# 2  853139                Lyon               Nancy 1195160021 1807420796             30
# 3  853139                Lyon               Nancy 1195160021  450557206            197
# 4  853139                Lyon               Nancy 1195160021 1671039854            131
# 5  853139                Lyon               Nancy 1195160021  108315093            227
# 6  853139                Lyon               Nancy 1195160021  582175015             44

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM