XML 数据提取，其中并非所有父节点都包含子节点

Question

我有一个 xml 数据文件，用户在其中开设了一个帐户，并且在某些情况下该帐户已被终止。 账户没有被终止时，数据没有列出值，这使得提取信息非常困难。

这是可重现的示例（其中只有用户 1 和 3 的帐户已被终止）：

library(XML)
my_xml <- xmlParse('<accounts>
                    <user>
                      <id>1</id>
                      <start>2015-01-01</start>
                      <termination>2015-01-21</termination>
                    </user>
                    <user>
                      <id>2</id>
                      <start>2015-01-01</start>
                    </user>
                    <user>
                      <id>3</id>
                      <start>2015-02-01</start>
                      <termination>2015-04-21</termination>
                    </user>
                    <user>
                      <id>4</id>
                      <start>2015-03-01</start>
                    </user>
                    <user>
                      <id>5</id>
                      <start>2015-04-01</start>
                    </user>
                    </accounts>')

要创建一个 data.frame 我已经尝试使用sapply但是由于当用户没有终止值时它不返回 NA，代码产生一个error: arguments imply differing number of rows: 5, 2

accounts <- data.frame(id=sapply(my_xml["//user//id"], xmlValue),
                       start=sapply(my_xml["//user//start"], xmlValue),
                       termination=sapply(my_xml["//user//termination"], xmlValue)
                       )

有关如何解决此问题的任何建议？

Answer 1

我更喜欢使用 xml2 包而不是 XML 包，我发现语法更易于使用。 这是一个直截了当的问题。 找到所有用户节点，然后解析出 id 和终止节点。 对于 xml2，如果未找到节点，则xml_find_first函数将返回 NA。

library(xml2)
my_xml <- read_xml('<accounts>
                   <user>
                   <id>1</id>
                   <start>2015-01-01</start>
                   <termination>2015-01-21</termination>
                   </user>
                   <user>
                   <id>2</id>
                   <start>2015-01-01</start>
                   </user>
                   <user>
                   <id>3</id>
                   <start>2015-02-01</start>
                   <termination>2015-04-21</termination>
                   </user>
                   <user>
                   <id>4</id>
                   <start>2015-03-01</start>
                   </user>
                   <user>
                   <id>5</id>
                   <start>2015-04-01</start>
                   </user>
                   </accounts>')

usernodes<-xml_find_all(my_xml, ".//user")
ids<-sapply(usernodes, function(n){xml_text(xml_find_first(n, ".//id"))})
terms<-sapply(usernodes, function(n){xml_text(xml_find_first(n, ".//termination"))})

answer<-data.frame(ids, terms)

Answer 2

我设法从R 中的 XPath找到解决方案：如果缺少节点，则返回 NA

accounts <- data.frame(id=sapply(my_xml["//user//id"], xmlValue),
                       start=sapply(my_xml["//user//start"], xmlValue),
                       termination=sapply(xpathApply(my_xml, "//user",
                                                     function(x){
                                                     if("termination" %in% names(x))
                                                     xmlValue(x[["termination"]])
                                                     else NA}), function(x) x))

XML 数据提取，其中并非所有父节点都包含子节点

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-03-08 13:57:32

解决方案2
0 2019-03-08 13:56:53

XML 数据提取，其中并非所有父节点都包含子节点

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-03-08 13:57:32

解决方案2 0 2019-03-08 13:56:53

解决方案1
1 已采纳 2019-03-08 13:57:32

解决方案2
0 2019-03-08 13:56:53