简体   繁体   中英

How to get NA using xpathApply if node doesnot exist in XML files in R?

Here the problem is some XML files do not includes some nodes in some instances, like "year" node in the example code. xpathApply will ignore it directly, however, I'd want to get the xmlValue together with NA with the original order. It looks like this is not similar with this post .

xml_string = c(
'<?xml version="1.0" encoding="UTF-8"?>',
'<movies>',
'<movie mins="126" lang="eng">',
'<title>Good Will Hunting</title>',
'<director>',
'<first_name>Gus</first_name>',
'<last_name>Van Sant</last_name>',
'</director>',
'<year>1998</year>',
'<genre>drama</genre>',
'</movie>',
'<movie mins="106" lang="spa">',
'<title>Y tu mama tambien</title>',
'<director>',
'<first_name>Alfonso</first_name>',
'<last_name>Cuaron</last_name>',
'</director>',
'<genre>drama</genre>',
'</movie>',
'<movie mins="106" lang="spa">',
'<title>ABC</title>',
'<director>',
'<first_name>Alfonso</first_name>',
'<last_name>Cuaron</last_name>',
'</director>',
'<year>2001</year>',
'<genre>drama</genre>',
'</movie>',
'</movies>')

library(XML)
movies_xml = xmlParse(xml_string, asText = TRUE)
unlist(xpathApply(movies_xml, "//year", xmlValue))

The result is:

[1] "1998" "2001"

how to get quickly:

"1998" NA "2001"

You could write a function to replace missing nodes with NA and collapse multiple nodes.

xmlGetValue <- function(x, node){
  a <- xpathSApply(x, node, xmlValue)
  ifelse(length(a) == 0, NA, 
   ifelse(length(a) > 1, paste(a, collapse=", "), a))
}

xpathSApply(movies_xml, "//movie", xmlGetValue, "./year")
[1] "1998" NA     "2001"

You can use an XPath boolean test per-parent node:

xpathSApply(movies_xml, "//movies/movie", function(x) {
  if (xpathSApply(x, "boolean(./year)")) {
    xpathSApply(x, "./year", xmlValue)
  } else {
    NA
  }
})

## [1] "1998" NA     "2001"

For those using xml2 , here's how to do it there:

library(xml2)

doc <- read_xml(paste0(xml_string, collapse="\n"))
movies <- xml_find_all(doc, "//movies/movie")
sapply(movies, function(x) {
  tryCatch(xml_text(xml_find_one(x, "./year")),
           error=function(err) NA)
})

Consider passing the xml string into a dataframe by movie node and create a list from the year column:

movies_xml = xmlParse(xml_string, asText = TRUE)
xmldf <-xmlToDataFrame(nodes = getNodeSet(movies_xml, "//movie"))
yearlist <- c(xmldf[3])

Output

$year
[1] "1998" NA     "2001"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM