Here the problem is some XML files do not includes some nodes in some instances, like "year" node in the example code. xpathApply
will ignore it directly, however, I'd want to get the xmlValue
together with NA
with the original order. It looks like this is not similar with this post .
xml_string = c(
'<?xml version="1.0" encoding="UTF-8"?>',
'<movies>',
'<movie mins="126" lang="eng">',
'<title>Good Will Hunting</title>',
'<director>',
'<first_name>Gus</first_name>',
'<last_name>Van Sant</last_name>',
'</director>',
'<year>1998</year>',
'<genre>drama</genre>',
'</movie>',
'<movie mins="106" lang="spa">',
'<title>Y tu mama tambien</title>',
'<director>',
'<first_name>Alfonso</first_name>',
'<last_name>Cuaron</last_name>',
'</director>',
'<genre>drama</genre>',
'</movie>',
'<movie mins="106" lang="spa">',
'<title>ABC</title>',
'<director>',
'<first_name>Alfonso</first_name>',
'<last_name>Cuaron</last_name>',
'</director>',
'<year>2001</year>',
'<genre>drama</genre>',
'</movie>',
'</movies>')
library(XML)
movies_xml = xmlParse(xml_string, asText = TRUE)
unlist(xpathApply(movies_xml, "//year", xmlValue))
The result is:
[1] "1998" "2001"
how to get quickly:
"1998" NA "2001"
You could write a function to replace missing nodes with NA and collapse multiple nodes.
xmlGetValue <- function(x, node){
a <- xpathSApply(x, node, xmlValue)
ifelse(length(a) == 0, NA,
ifelse(length(a) > 1, paste(a, collapse=", "), a))
}
xpathSApply(movies_xml, "//movie", xmlGetValue, "./year")
[1] "1998" NA "2001"
You can use an XPath boolean
test per-parent node:
xpathSApply(movies_xml, "//movies/movie", function(x) {
if (xpathSApply(x, "boolean(./year)")) {
xpathSApply(x, "./year", xmlValue)
} else {
NA
}
})
## [1] "1998" NA "2001"
For those using xml2
, here's how to do it there:
library(xml2)
doc <- read_xml(paste0(xml_string, collapse="\n"))
movies <- xml_find_all(doc, "//movies/movie")
sapply(movies, function(x) {
tryCatch(xml_text(xml_find_one(x, "./year")),
error=function(err) NA)
})
Consider passing the xml string into a dataframe by movie node and create a list from the year column:
movies_xml = xmlParse(xml_string, asText = TRUE)
xmldf <-xmlToDataFrame(nodes = getNodeSet(movies_xml, "//movie"))
yearlist <- c(xmldf[3])
Output
$year
[1] "1998" NA "2001"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.