简体   繁体   中英

Import XML file to data frame in R

I am having difficulties importing XML files with a specific structure to a dataframe in R.

An example of the XML file can be found here: XML to import

The end result should be a data frame which is nicely structured until the deepest nest level (Adms) of the XML file: 的例子

This would mean that data is repeated in other columns but that is not a problem.

I've tried multiple solutions found on StackOverflow for XML import but due to the structure of the XML file I have I cannot get it too work.

For the moment I am thus required to go through Excel and transform the XML to CSV with the "GetData" option, but as I have 100's of these XMLs to process I would like to automate this task.

Thank you in advance for your help!

This will give you one patient per row:

library(xml2)
library(XML)
library(tidyverse)

xml <-
  read_xml("~/Downloads/00000123456_0000071234567123_20150922101212_TH.XML") %>%
  xmlParse()

xml %>%
  xmlToDataFrame(nodes = getNodeSet(xml, "//Patient")) %>%
  as_tibble()
#> # A tibble: 12 x 15
#>    Id      Name      Firstname HomeID Location1   Location2  Location3 Location4
#>    <chr>   <chr>     <chr>     <chr>  <chr>       <chr>      <chr>     <chr>    
#>  1 360923… Van den … Freya     002    VILLA TRAN… 1e verdie… Sectie 1  104      
#>  2 211209… Verhofst… Guy       004    VILLA TRAN… 1e verdie… Sectie 1  102      
#>  3 410630… Hanckeli… Laurette  005    VILLA TRAN… 1e verdie… Sectie 1  102      
#>  4 251019… Smaak     Antoinet… 007    VILLA TRAN… 1e verdie… Sectie 1  101      
#>  5 281213… Areno     Marie     008    VILLA TRAN… 1e verdie… Sectie 1  103      
#>  6 190219… De Waen   Patrick   010    VILLA TRAN… 2e verdie… Sectie 2  203      
#>  7 271023… Vande Ma… Johan     012    VILLA TRAN… 2e verdie… Sectie 2  201      
#>  8 311215… Dirupa    Elise     014    VILLA TRAN… 2e verdie… Sectie 2  202      
#>  9 320704… Zomers    Bertha    015    VILLA TRAN… 2e verdie… Sectie 1  202      
#> 10 100112… Daerdenne Micheline 019    VILLA TRAN… 2e verdie… Sectie 2  204      
#> 11 461217… Schoeppe  Antoine   021    VILLA TRAN… 1e verdie… Sectie 1  101      
#> 12 201114… Vanrompu… Germain   022    VILLA TRAN… 2e verdie… Sectie 2  206      
#> # … with 7 more variables: Location5 <chr>, Birthdate <chr>, DoctorName <chr>,
#> #   DoctorMedRegNr <chr>, PatientUnidose <chr>, Shortstay <chr>, Products <chr>

Created on 2022-02-16 by the reprex package (v2.0.0)

Since Products can not fit into column names of the Patients table, one can keep them in one column:

read_xml("~/Downloads/00000123456_0000071234567123_20150922101212_TH.XML") %>%
  xml_find_all("//Patient") %>%
  as_list() %>%
  map(~ {
    .x %>%
      enframe() %>%
      filter(name != "Products") %>%
      unnest_auto(value) %>%
      pivot_wider() %>%
      mutate(Products = list(.x$Products))
  }) %>%
  bind_rows()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM