简体   繁体   中英

Read missing XML tags as 0 or NA?

I have several XML documents with the following structure:

read_xml(filename, sep="")) %>% xml_find_all("//_atraso") %>% xml_structure

[[1]]
<_atraso>
  <_omsmaximodia [_omsmaximodia]>
  <_omsmaximo [_omsmaximo]>
  <_omsmedia [_omsmedia]>
  ...
...

[[32]]
<_atraso>
  <_omsmaximo [_omsmaximo]>
  <_omsmedia [_omsmedia]>
  ...

As one can see, the _atraso parent tag has the _omsmaximodia child tag on some of the items, but not in others (in this case at index 1 the child tag is present, while at index 32 it is not).

I want to read the value of _omsmaximodia when it is present, and 0 or NA otherwise. Currently I'm reading it like this:

omsmaximodia <- read_xml(filename, sep="")) %>% xml_find_all("//_omsmaximodia") %>% xml_attr("_omsmaximodia") %>% gsub("\\.","",.) %>% gsub(",",".",.) %>% {as.numeric(.)}

However, this does not read anything when the _omsmaximodia tag is not present. Running the code above results in a list of length 29, because only 29 out of the 32 items have the _omsmaximodia tag. I need the length to be 32, reading 0 or NA where it is not present.

I could easily add NAs or 0s to the list, but the order with which the items are read matters. That is, if item 30 did not have the _omsmaximodia tag, then the value at position 30 in the list must be NA or 0. Simply appending 0 or NA to the end of the list is unacceptable.

I tried using the xml_missing and the xml_has_attr functions to find out which indexes do not contain the _omsmaximodia tag, but those functions do not seem to indicate missing tags and I was unable to determine the index at which they are missing.

Any ideas?

In order to keep the structure of your xml-document, you could try to apply your function to all elements separately. The following example illustrates with made up data, since you only sketched your data structure.

# load packages and read data
library(xml2)
library(purrr)

input <- "<xml>
  <_atraso>
    <_omsmaximodia></_omsmaximodia>
  </_atraso>
  <_atraso>
  </_atraso>
</xml>"

x <- read_xml(input)
x
#> {xml_document}
#> <xml>
#> [1] <_atraso>\n  <_omsmaximodia/>\n</_atraso>
#> [2] <_atraso>\n  </_atraso>

We can find the tag of interest, but we get no missing value for the second tag with a conventional approach:

x %>% 
  xml_find_all(".//_omsmaximodia")
#> {xml_nodeset (1)}
#> [1] <_omsmaximodia/>

To solve the problem, we step one level deeper with xml_children , and then map over all elements. The result for the second element is an empty nodeset. We can use map_if in combination with is_empty to turn it to missing values.

x %>% 
  xml_children() %>% 
  map(xml_find_all, ".//_omsmaximodia") %>% 
  map_if(is_empty, ~{.x <- NA}) 
#> [[1]]
#> {xml_nodeset (1)}
#> [1] <_omsmaximodia/>
#> 
#> [[2]]
#> [1] NA

Depending on what you need to do, you can employ different functions to flatten or modify the list structure.

Note that with this code the second version is roughly 4 times slower. If you do this a few times, it doesn't matter (the second query takes approx. 0.75ms, compared to 0.2ms for the first), but if you do it often (ie parsing many documents), this might add up.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM