[英]Extract text from XML, but file has duplicated node-names
Im trying to import some data from an XML-file into an R data.frame. 我试图从XML文件导入一些数据到R data.frame。 While I'm pretty experienced with R, I have never worked with an XML before, so all this is pretty new to me and I feel a little lost.
虽然我对R有相当的经验,但是我以前从未使用过XML,所以对我来说这一切都是很新的,我感到有些失落。
A sample of the XML is provided below: 下面提供了XML的示例:
<ArchivedIncident ID="100">
<attributes>
<entry>
<key>TEST1</key>
<value>
<type>S</type>
<value/>
</value>
</entry>
<entry>
<key>TEST2</key>
<value>
<type>S</type>
<value>12</value>
</value>
</entry>
<entry>
<key>TEST3</key>
<value>
<type>T</type>
<value>A</value>
</value>
</entry>
<entry>
<key>TEST4</key>
<value>
<type>S</type>
<value/>
</value>
</entry>
</attributes>
</ArchivedIncident>
<ArchivedIncident ID="101">
<attributes>
<entry>
<key>TEST1</key>
<value>
<type>S</type>
<value>BLAH</value>
</value>
</entry>
<entry>
<key>TEST2</key>
<value>
<type>S</type>
<value/>
</value>
</entry>
<entry>
<key>TEST3</key>
<value>
<type>T</type>
<value/>
</value>
</entry>
<entry>
<key>TEST4</key>
<value>
<type>S</type>
<value/>
</value>
</entry>
</attributes>
</ArchivedIncident>
What I like to accomplish, is an R-data.frame that looks like this: 我想完成的是一个R-data.frame,看起来像这样:
ID TEST1 TEST2 TEST3 TEST4
100 NA 12 A NA
101 BLAH NA NA NA
What I have come up with so far: 到目前为止,我想出了什么:
Using the xml2 package, I can read the ID's using: 使用xml2包,我可以使用以下命令读取ID:
require(xml2)
doc <- read_xml("./data/file.xml")
df <- data.frame(
ID = xml_attr( xml_find_all( doc, ".//ArchivedIncident" ), "ID" )
)
So far so good, but now I'm lost how to extract the rest. 到目前为止,一切都很好,但是现在我迷失了如何提取其余部分。 There are multiple nodes, all named "entry", "value" and "type".
有多个节点,都命名为“ entry”,“ value”和“ type”。 How can I extract the text from the (for use as a column name), and the value for this key (which is the following after of that .
如何从中提取文本(用作列名)和该键的值(在此之后)。
Complicating factor, is that not every has a value. 复杂的因素是,并非每个人都有价值。 I would like to insert a "NA" for the empty values.
我想为空值插入一个“ NA”。 In another situation, I was able to use a custom function for this, but I'm not sure (since I don't know how to extract the right text) if this will work here.
在另一种情况下,我可以为此使用自定义函数,但是我不确定(因为我不知道如何提取正确的文本)是否可以在这里使用。
L <- xml_find_all(doc, ".//ArchivedIncident")
FindAllValues <- function(node){
tmp <- lapply(L, xml_find_all, paste0(".//", node))
tmp <- lapply(tmp, xml_text)
tmp[!sapply(tmp, function(y) length(y == 0))] <- NA
return(tmp)
}
library(xml2)
library(tidyverse)
doc <- read_xml("file.xml")
xml_find_all(doc, ".//ArchivedIncident") %>% # iterate over each incident
map_df(~{
set_names(
xml_find_all(.x, ".//value/value") %>% xml_text(), # get entry values
xml_find_all(.x, ".//key") %>% xml_text() # get entry keys (column names)
) %>%
as.list() %>% # turn named vector to list
flatten_df() %>% # and list to df
mutate(ID = xml_attr(.x, "ID")) # add id
}) %>%
type_convert() %>% # let R convert the values for you
select(ID, everything()) # get it in the order you likely want
## # A tibble: 2 x 5
## ID TEST1 TEST2 TEST3 TEST4
## <int> <chr> <int> <chr> <chr>
## 1 100 <NA> 12 A <NA>
## 2 101 BLAH NA <NA> <NA>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.