解析data.frame中XML的單列

Question

我已將數據作為 data.frame 提供，但一列包含一個單元格，每個單元格包含幾個 xml 元素。

像這樣的東西...

label_col = c("A", "B")

number_col = c(123, 456)

XML_col = c("<CD><TITLE>Empire Burlesque</TITLE><ARTIST>Bob Dylan</ARTIST></CD><CD><TITLE>Hide your heart</TITLE><ARTIST>Bonnie Tyler</ARTIST></CD>", 
             "<CD><TITLE>ABC</TITLE><ARTIST>XYZ</ARTIST></CD><CD><TITLE>EFG</TITLE><ARTIST>UVW</ARTIST></CD></CATALOG>")

Sample_df = data.frame(label_col, number_col, XML_col)

現在我可以看到每個單元格中的 XML 不包含在一對標簽中，所以我添加了它們

library(dplyr)

Sample_df %>%
mutate(XML_col = paste0("<Data>",XML_col,"</Data>"))

現在因為每個 XML 元素都包含多 (2) 個項目，我希望我的 dataframe 到 go 是...，從 2 x 3 到 4 x_colISTTLE，

我被困住了。 我試過使用 unnest 和 unnest_longer 但我真的不明白該怎么做。

大多數關於 xml 解析的示例似乎都以 XML 文件開頭，而不是上面的混合文件。

有人可以指導我嗎？ （不要說哞！）

非常感謝！

Answer 1

我將假設不匹配的</CATALOG>標記只是一個錯字，並且您的實際輸入是經過驗證的、格式正確的 XML。

一般步驟是

將字符串解析為 R class xml_document
將目標節點提取為列表列
取消嵌套列表列

下面演示了如何為TITLE節點執行此操作，但也應為其他節點輕松復制。

library(dplyr)
library(purrr)
library(xml2)
library(tidyr)

label_col <- c("A", "B")

number_col <- c(123, 456)

# dropped unmatched <CATALOG> tag
XML_col <- c("<CD><TITLE>Empire Burlesque</TITLE><ARTIST>Bob Dylan</ARTIST></CD><CD><TITLE>Hide your heart</TITLE><ARTIST>Bonnie Tyler</ARTIST></CD>", 
             "<CD><TITLE>ABC</TITLE><ARTIST>XYZ</ARTIST></CD><CD><TITLE>EFG</TITLE><ARTIST>UVW</ARTIST></CD>")

data.frame(label_col, number_col, XML_col) %>%
  mutate(
    XML_col = paste0("<Data>",XML_col,"</Data>"),
    XML_col = map(XML_col, read_xml),
    XML_titles = map(XML_col, ~xml_find_all(.x, ".//TITLE") %>% xml_text())
  ) %>% 
  unnest(XML_titles)
#> # A tibble: 4 x 4
#>   label_col number_col XML_col    XML_titles      
#>   <chr>          <dbl> <list>     <chr>           
#> 1 A                123 <xml_dcmn> Empire Burlesque
#> 2 A                123 <xml_dcmn> Hide your heart 
#> 3 B                456 <xml_dcmn> ABC             
#> 4 B                456 <xml_dcmn> EFG

^{由代表 package (v1.0.0) 於 2021 年 4 月 9 日創建}

解析data.frame中XML的單列

問題描述

1 個解決方案

解決方案1
0 2021-04-09 17:46:22

解析data.frame中XML的單列

問題描述

1 個解決方案

解決方案1 0 2021-04-09 17:46:22

解決方案1
0 2021-04-09 17:46:22