简体   繁体   中英

XML and XSLT-V2 processing or unnest deeply nested lists in R - data from the Pharmaceutical Benefits Scheme

Question:

My colleges and I have been trying to convert the Pharmaceutical Benefits Scheme V3 XML ( https://www.pbs.gov.au/browse/downloads ) into a usable format for a long time and we've had limited success. I've been able to convert the PBS XML V3 into an R list of lists however unpacking that into a usable format programmatically has been very difficult.

PBS_V3_XML_Doc <- read_xml(x = Path_to_PBS_V3_XML_File
              ,options = c("RECOVER", "NOBLANKS", "HUGE")   # Important options that enable import. 
                         # Lifts hardcoded limitations because of the massive file size 
               ,verbose = TRUE)    # Enables message print feedback


PBS_V3_XML_NameSpaces <- xml_ns(PBS_V3_XML_Doc)           # xml namespaces 
PBS_V3_XML_List <- xml2::as_list(x = PBS_V3_XML_Doc)    # converts the XML document into a R list object

I've tried combinations of unnest_wider, unnest_longer, unnest, unlist, unnest_auto, and many more (some below). But we havent had any luck.

Test_docall_unnest <- do.call(c, unlist(PBS_V3_XML_List, recursive=FALSE))

flattenlist <- function(x){  
        morelists <- sapply(x, function(xprime) class(xprime)[1]=="list")
        out <- c(x[!morelists], unlist(x[morelists], recursive=FALSE))
        if(sum(morelists)){ 
                Recall(out)
                }else{
        return(out) }}

recursive_unnest <- function(.data) {
        # Exit condition: no more 'children' list-column
        if (!"children" %in% names(.data) || !is.list(.data[["children"]])) return(.data)
        x <- unnest(.data)
        # Minor clean-up to make unnest work: replace NULLs with empty data frames
        x <- mutate_if(x, is.list, 
                ~ map_if(.x, ~ is.null(.x) || identical(.x, list()), ~ data.frame(date = NA)))           Recall(x)  
}

Part of the solution

Using stylesheets may have worked however the xslt package doesnt accept XML stylesheets that are version 2 or higher, so sadly that package doesnt help out for this situation. I've tried using a few of the xsl stylesheets found within the PBS V3 Schema found here , but I'm honestly unsure which stylesheet I should be using (I have tried the file within the /xsl folder) 尝试将 xsl 应用于 xml 时显示的错误 R 终止

Why not XSLT 1.0 script which R's xslt package can run? XSLT 1.0 script can flatten needed nodes and all their descendants limited to nodes with text information. The repeated node appears to be the <previous> tags. Once flattened, transformed XML can migrate easily into data frame. Each descendant has its parent and grandparent names attached to return name to be complete for data frame columns.

XSLT (save as .xsl, a special .xml file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                              xmlns:pbs="http://schema.pbs.gov.au/"
                              xmlns:p="http://pbs.gov.au/"
                              xmlns:dbk="http://docbook.org/ns/docbook"
                              xmlns:db="http://docbook.org/ns/docbook#"
                              xmlns:dc="http://purl.org/dc/elements/1.1/"
                              xmlns:dct="http://purl.org/dc/terms/"
                              xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
                              xmlns:owl="http://www.w3.org/2002/07/owl#"
                              xmlns:skos="http://www.w3.org/2004/02/skos/core#"
                              xmlns:svg="http://www.w3.org/2000/svg"
                              xmlns:ext="http://extension.schema.pbs.gov.au/"
                              xmlns:xlink="http://www.w3.org/1999/xlink">
    <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="pbs:root">
     <xsl:copy>
       <xsl:copy-of select="@*"/>
       <xsl:apply-templates select="pbs:previous-list"/>
     </xsl:copy>
    </xsl:template>
    
    <xsl:template match="pbs:previous-list">
       <xsl:apply-templates select="pbs:previous"/>
    </xsl:template>
    
    <xsl:template match="pbs:previous">
     <data>
       <xsl:for-each select="descendant::*">
           <xsl:if test="text() != ''">
           <xsl:element name="{translate(concat(local-name(../parent::*), '_', local-name(parent::*), '_', local-name()), '-', '_')}">
               <xsl:value-of select="text()"/>
           </xsl:element>
           </xsl:if>
       </xsl:for-each>
     </data>
    </xsl:template>
    
</xsl:stylesheet>

XSLT Demo

R

library(xml2)
library(xslt)
library(dplyr)

# INPUT SOURCE
doc <- read_xml("/path/to/sch-2020-09-01-r1.xml")
style <- read_xml("/path/to/style.xsl", package = "xslt")

# TRANSFORM 
new_xml <- xml_xslt(doc, style)

# RETRIEVE data NODES
recs <- xml_find_all(new_xml, "//data")

# BIND EACH CHILD TEXT AND NAME TO Player DFs
df_list <- lapply(recs, function(r) 
  data.frame(rbind(setNames(xml_text(xml_children(r)), 
                            xml_name(xml_children(r)))),
             stringsAsFactors = FALSE)
)

# BIND ALL DFs TO SINGLE MASTER DF
final_df <- dplyr::bind_rows(df_list)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM