简体   繁体   中英

convert cellosaurus.xml file into a data.frame in R

I have an XML file that I haven't been able to get into a good data.frame format. I'm close but it's not quite there yet.

cellosaurus.xml slightly modified this file by removing everything before and after <cell-line-list> and </cell-line-list> tags

This is the messy code I've written so far:

require(XML)
require(xml2)
require(rvest)
require(dplyr)
require(xmltools)
require(stringi)
require(gtools)
setwd("~/Documents/Cancer_Cell_Lines/Cellosaurus")

file <- "cellosaurus.xml"
cellosaurus <- file %>% xml2::read_xml()
nodeset <- cellosaurus %>% xml_children()

terminal_xpaths <- nodeset[1] %>% xml_get_paths() %>% unlist() %>% unique()
terminal_nodesets <- lapply(terminal_xpaths[1], xml2::xml_find_all, x = cellosaurus)
df_list <- terminal_nodesets %>% purrr::map(xml_dig_df)
df <- lapply(df_list[[1]], function(x) as.data.frame(x))
table <- do.call("smartbind", df) 

Problem 1: There are duplicate column names that are mixed up. For example in the file there are many paths that end up at a node called cv.term like

"/cell-line-list/cell-line/disease-list/cv-term" 
"/cell-line-list/cell-line/species-list/cv-term" 
"/cell-line-list/cell-line/derived-from/cv-term" 

but in the table I get columns called cv.term , cv.term.1 , cv.term.2 but the contents are mixed up because of missing data. Is there a way to fix this.

Problem 2: The file is big and it takes a long time to run (I've only been able to test on a small subset of the full file), I haven't been able to figure out how to split the xml correctly except by splitting into as many files are there are nodes ~109,000. And then I had a hard time incorporating that many files into my code for R to read.

Any help appreciated.

To use the relational database terminology, consider data normalization. Specifically, keep your data long as most nodes in XML are practically all one-to-many lists which you can extract each one as individual long data frames and merge together by a unique id such as cell_line node number.

Fortunately, there is a great extraction tool available known as XSLT , the special purpose, declarative language (same type as SQL) designed to transform XML into various end use needs such as extracting the individual pieces that you can parse more simply into data frames and then merge all items together. The beauty too is XSLT has nothing to do with R and is portable to other application layers (Java, PHP, Python) or dedicated XSLT processors .

See process below for roadmap to final solution. All XSLT scripts below parses from a specific part of every cell-line node and flattens XML to one child level:

R

library(xml2)
library(xslt)    # INSTALL PACKAGE BEFORE HAND
library(dplyr)   # ONLY FOR bind_rows

# PARSE XML AND XSLT
doc <- read_xml('Cellosaurus.xml')
scripts <- list.files(path='/path/to/xslt/scripts', pattern='.xsl')

xpaths <- c('//accession', '//cell-line', '//hla_gene', '//marker', 
            '//name', '//species_list', '//url')

proc_xml_parse <- function(x, s) {
  style <- read_xml(s, package = "xslt")

  # TRANSFORM INPUT INTO OUTPUT
  new_xml <- xslt::xml_xslt(doc, style)

  # INNER DF LIST BUILD
  df_list <- lapply(xml_find_all(new_xml, x), function(x) { 
    vals <- xml_children(x)
    setNames(data.frame(t(xml_text(vals)), stringsAsFactors = FALSE), xml_name(vals))
  })

  bind_rows(df_list)
}

# OUTER DF LIST BUILD    
df_list <- Map(proc_xml_parse, xpaths, scripts)

# CHAIN MERGE
final_df <- Reduce(function(x,y) merge(x, y, by="cell_num", all=TRUE), df_list)

XSLT Scripts

Save each as separate .xsl or .xslt files (special .xml files) to be loaded in R above. Add more XSLT scripts by replicating patterns for other list nodes in XML as below does not capture all.

Cell Line List

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="Cellosaurus">
        <xsl:copy>
            <xsl:apply-templates select="cell-line-list/cell-line"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="cell-line">
        <xsl:copy>
            <cell_num>
                <xsl:value-of select="count(preceding-sibling::*)+1"/>
            </cell_num>
            <xsl:for-each select="@*">
                <xsl:element name="{name(.)}">
                    <xsl:value-of select="."/>
                </xsl:element>
            </xsl:for-each>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

Accession List

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="Cellosaurus">
        <xsl:copy>
            <xsl:apply-templates select="cell-line-list/cell-line"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="cell-line">
        <xsl:apply-templates select="accession-list"/>
    </xsl:template>

    <xsl:template match="accession-list">
        <xsl:apply-templates select="accession"/>
    </xsl:template>

    <xsl:template match="accession">
        <xsl:copy>
            <cell_num>
                <xsl:value-of select="count(ancestor::cell-line[1]/preceding-sibling::*)+1"/>
            </cell_num>
            <xsl:for-each select="@*">
                <xsl:element name="{name(.)}">
                    <xsl:value-of select="."/>
                </xsl:element>
            </xsl:for-each>
            <accession_value><xsl:value-of select="."/></accession_value>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

Name List

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="Cellosaurus">
        <xsl:copy>
            <xsl:apply-templates select="cell-line-list/cell-line"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="cell-line">
        <xsl:apply-templates select="name-list"/>
    </xsl:template>

    <xsl:template match="name-list">
        <xsl:apply-templates select="name"/>
    </xsl:template>

    <xsl:template match="name">
        <xsl:copy>
            <cell_num>
                <xsl:value-of select="count(ancestor::cell-line/preceding-sibling::*)+1"/>
            </cell_num>
            <xsl:for-each select="@*">
                <xsl:element name="{name(.)}">
                    <xsl:value-of select="."/>
                </xsl:element>
            </xsl:for-each>
            <name_value><xsl:value-of select="."/></name_value>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

Web Page List

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="Cellosaurus">
        <xsl:copy>
            <xsl:apply-templates select="cell-line-list/cell-line"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="cell-line">
        <xsl:apply-templates select="web-page-list"/>
    </xsl:template>

    <xsl:template match="web-page-list">
        <xsl:apply-templates select="url"/>
    </xsl:template>

    <xsl:template match="url">
        <xsl:copy>
            <cell_num>
                <xsl:value-of select="count(ancestor::cell-line/preceding-sibling::*)+1"/>
            </cell_num>
            <xsl:for-each select="@*">
                <xsl:element name="{name(.)}">
                    <xsl:value-of select="."/>
                </xsl:element>
            </xsl:for-each>
            <url_value><xsl:value-of select="."/></url_value>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

HLA List

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="Cellosaurus">
        <xsl:copy>
            <xsl:apply-templates select="cell-line-list/cell-line"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="cell-line">
        <xsl:apply-templates select="hla-lists/hla-list"/>
    </xsl:template>

    <xsl:template match="hla-list">
        <xsl:apply-templates select="hla-gene"/>
    </xsl:template>

    <xsl:template match="hla-gene">
        <hla_gene>
            <cell_num>
                <xsl:value-of select="count(ancestor::cell-line/preceding-sibling::*)+1"/>
            </cell_num>
            <xsl:for-each select="@*">
                <xsl:element name="{name(.)}">
                    <xsl:value-of select="."/>
                </xsl:element>
            </xsl:for-each>
            <hla_value><xsl:value-of select="."/></hla_value>
        </hla_gene>
    </xsl:template>

</xsl:stylesheet>

Special List

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="Cellosaurus">
        <xsl:copy>
            <xsl:apply-templates select="cell-line-list/cell-line"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="cell-line">
        <xsl:apply-templates select="species-list/cv-term"/>
    </xsl:template>

    <xsl:template match="cv-term">
        <species_list>
            <cell_num>
                <xsl:value-of select="count(ancestor::cell-line/preceding-sibling::*)+1"/>
            </cell_num>
            <xsl:for-each select="@*">
                <xsl:element name="{name(.)}">
                    <xsl:value-of select="."/>
                </xsl:element>
            </xsl:for-each>
            <species_value><xsl:value-of select="."/></species_value>
        </species_list>
    </xsl:template>

</xsl:stylesheet>

Marker List

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="Cellosaurus">
        <xsl:copy>
            <xsl:apply-templates select="cell-line-list/cell-line"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="cell-line">
        <xsl:apply-templates select="str-list"/>
    </xsl:template>

    <xsl:template match="str-list">
        <xsl:apply-templates select="marker-list"/>
    </xsl:template>

    <xsl:template match="marker-list">
        <xsl:apply-templates select="marker"/>
    </xsl:template>

    <xsl:template match="marker">
        <xsl:copy>
            <cell_num>
                <xsl:value-of select="count(ancestor::cell-line/preceding-sibling::*)+1"/>
            </cell_num>
            <xsl:for-each select="@*">
                <xsl:element name="{name(.)}">
                    <xsl:value-of select="."/>
                </xsl:element>
            </xsl:for-each>
            <xsl:copy-of select="marker-data-list/marker-data/alleles"/>
        </xsl:copy>
    </xsl:template>        
</xsl:stylesheet>

Output

After chain merge where values repeat for every unique row similar to SQL joins for long data frames (many-to-many). Do note: there is a named list of data frames should you not want below merged output:

数据输出

Just one comment: when you say "~109,000 cell lines with variations in missing data between each cell-line", you need to understand that the only madatory fields in a Cellosaurus entry are the primary accession, the cell line name (identifier), the cell line category and the taxonomy, all the rest are not required. All this is described in the cellosaurus.xsd files either using "minoccurs="0" or use "optional" depending on the type of field.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM