从XML提取文本，但是文件具有重复的节点名

Question

Im trying to import some data from an XML-file into an R data.frame. 我试图从XML文件导入一些数据到R data.frame。 While I'm pretty experienced with R, I have never worked with an XML before, so all this is pretty new to me and I feel a little lost. 虽然我对R有相当的经验，但是我以前从未使用过XML，所以对我来说这一切都是很新的，我感到有些失落。

A sample of the XML is provided below: 下面提供了XML的示例：

<ArchivedIncident ID="100">
    <attributes>
        <entry>
            <key>TEST1</key>
            <value>
                <type>S</type>
                <value/>
            </value>
        </entry>
        <entry>
            <key>TEST2</key>
            <value>
                <type>S</type>
                <value>12</value>
            </value>
        </entry>
        <entry>
            <key>TEST3</key>
            <value>
                <type>T</type>
                <value>A</value>
            </value>
        </entry>
        <entry>
            <key>TEST4</key>
            <value>
                <type>S</type>
                <value/>
            </value>
        </entry>
    </attributes>
</ArchivedIncident>
<ArchivedIncident ID="101">
    <attributes>
        <entry>
            <key>TEST1</key>
            <value>
                <type>S</type>
                <value>BLAH</value>
            </value>
        </entry>
        <entry>
            <key>TEST2</key>
            <value>
                <type>S</type>
                <value/>
            </value>
        </entry>
        <entry>
            <key>TEST3</key>
            <value>
                <type>T</type>
                <value/>
            </value>
        </entry>
        <entry>
            <key>TEST4</key>
            <value>
                <type>S</type>
                <value/>
            </value>
        </entry>
    </attributes>
</ArchivedIncident>

What I like to accomplish, is an R-data.frame that looks like this: 我想完成的是一个R-data.frame，看起来像这样：

ID     TEST1    TEST2    TEST3    TEST4
100    NA       12       A        NA
101    BLAH     NA       NA       NA

What I have come up with so far: 到目前为止，我想出了什么：

Using the xml2 package, I can read the ID's using: 使用xml2包，我可以使用以下命令读取ID：

require(xml2)
doc <- read_xml("./data/file.xml")
df <- data.frame( 
  ID = xml_attr( xml_find_all( doc, ".//ArchivedIncident" ), "ID" )
  )

So far so good, but now I'm lost how to extract the rest. 到目前为止，一切都很好，但是现在我迷失了如何提取其余部分。 There are multiple nodes, all named "entry", "value" and "type". 有多个节点，都命名为“ entry”，“ value”和“ type”。 How can I extract the text from the (for use as a column name), and the value for this key (which is the following after of that . 如何从中提取文本（用作列名）和该键的值（在此之后）。

Complicating factor, is that not every has a value. 复杂的因素是，并非每个人都有价值。 I would like to insert a "NA" for the empty values. 我想为空值插入一个“ NA”。 In another situation, I was able to use a custom function for this, but I'm not sure (since I don't know how to extract the right text) if this will work here. 在另一种情况下，我可以为此使用自定义函数，但是我不确定（因为我不知道如何提取正确的文本）是否可以在这里使用。

L <- xml_find_all(doc, ".//ArchivedIncident")
FindAllValues <- function(node){
    tmp <- lapply(L, xml_find_all, paste0(".//", node))
    tmp <- lapply(tmp, xml_text)
    tmp[!sapply(tmp, function(y) length(y == 0))] <- NA
    return(tmp)
}

Answer 1

library(xml2)
library(tidyverse)

doc <- read_xml("file.xml")

xml_find_all(doc, ".//ArchivedIncident") %>% # iterate over each incident
  map_df(~{
    set_names(
      xml_find_all(.x, ".//value/value") %>% xml_text(), # get entry values
      xml_find_all(.x, ".//key") %>% xml_text()          # get entry keys (column names)
    ) %>% 
      as.list() %>%                                      # turn named vector to list
      flatten_df() %>%                                   # and list to df
      mutate(ID = xml_attr(.x, "ID"))                    # add id
  }) %>%
  type_convert() %>% # let R convert the values for you
  select(ID, everything()) # get it in the order you likely want
## # A tibble: 2 x 5
##      ID TEST1 TEST2 TEST3 TEST4
##   <int> <chr> <int> <chr> <chr>
## 1   100  <NA>    12     A  <NA>
## 2   101  BLAH    NA  <NA>  <NA>

从XML提取文本，但是文件具有重复的节点名

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-01-27 16:36:41

从XML提取文本，但是文件具有重复的节点名

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-01-27 16:36:41

解决方案1
2 已采纳 2018-01-27 16:36:41