简体   繁体   English

将cellosaurus.xml文件转换为R中的data.frame

[英]convert cellosaurus.xml file into a data.frame in R

I have an XML file that I haven't been able to get into a good data.frame format. 我有一个XML文件,但我无法将其转换成良好的data.frame格式。 I'm close but it's not quite there yet. 我已经接近了,但还没到那儿。

cellosaurus.xml slightly modified this file by removing everything before and after <cell-line-list> and </cell-line-list> tags cellosaurus.xml通过删除<cell-line-list></cell-line-list>标记之前和之后的所有内容对该文件进行了少许修改。

This is the messy code I've written so far: 到目前为止,这是我编写的凌乱代码:

require(XML)
require(xml2)
require(rvest)
require(dplyr)
require(xmltools)
require(stringi)
require(gtools)
setwd("~/Documents/Cancer_Cell_Lines/Cellosaurus")

file <- "cellosaurus.xml"
cellosaurus <- file %>% xml2::read_xml()
nodeset <- cellosaurus %>% xml_children()

terminal_xpaths <- nodeset[1] %>% xml_get_paths() %>% unlist() %>% unique()
terminal_nodesets <- lapply(terminal_xpaths[1], xml2::xml_find_all, x = cellosaurus)
df_list <- terminal_nodesets %>% purrr::map(xml_dig_df)
df <- lapply(df_list[[1]], function(x) as.data.frame(x))
table <- do.call("smartbind", df) 

Problem 1: There are duplicate column names that are mixed up. 问题1:有重复的列名混在一起。 For example in the file there are many paths that end up at a node called cv.term like 例如,在文件中,有许多路径最终到达名为cv.term的节点,例如

"/cell-line-list/cell-line/disease-list/cv-term" 
"/cell-line-list/cell-line/species-list/cv-term" 
"/cell-line-list/cell-line/derived-from/cv-term" 

but in the table I get columns called cv.term , cv.term.1 , cv.term.2 but the contents are mixed up because of missing data. 但是在表中,我得到了名为cv.termcv.term.1cv.term.2列,但是由于缺少数据, cv.term.2内容混淆。 Is there a way to fix this. 有没有办法解决这个问题。

Problem 2: The file is big and it takes a long time to run (I've only been able to test on a small subset of the full file), I haven't been able to figure out how to split the xml correctly except by splitting into as many files are there are nodes ~109,000. 问题2:文件很大,需要很长时间才能运行(我只能在完整文件的一小部分上进行测试),但我无法弄清楚如何正确分割xml,除了通过分割成尽可能多的文件,大约有109,000个节点。 And then I had a hard time incorporating that many files into my code for R to read. 然后,我很难将这么多文件合并到我的代码中以供R读取。

Any help appreciated. 任何帮助表示赞赏。

To use the relational database terminology, consider data normalization. 要使用关系数据库术语,请考虑数据规范化。 Specifically, keep your data long as most nodes in XML are practically all one-to-many lists which you can extract each one as individual long data frames and merge together by a unique id such as cell_line node number. 具体来说,请保持数据的长度,因为XML中的大多数节点实际上都是一对多列表,您可以将每个列表提取为单独的长数据帧,并通过诸如cell_line节点号之类的唯一ID合并在一起。

Fortunately, there is a great extraction tool available known as XSLT , the special purpose, declarative language (same type as SQL) designed to transform XML into various end use needs such as extracting the individual pieces that you can parse more simply into data frames and then merge all items together. 幸运的是,有一个功能强大的提取工具称为XSLT ,这是一种专用的声明性语言(与SQL相同的类型),旨在将XML转换为各种最终用途需求,例如提取可以更简单地解析为数据帧的各个片段,以及然后将所有项目合并在一起。 The beauty too is XSLT has nothing to do with R and is portable to other application layers (Java, PHP, Python) or dedicated XSLT processors . XSLT也与R无关,它可以移植到其他应用程序层(Java,PHP,Python)或专用XSLT处理器中,这也是其优点

See process below for roadmap to final solution. 有关最终解决方案的路线图,请参见下面的过程。 All XSLT scripts below parses from a specific part of every cell-line node and flattens XML to one child level: 下面的所有XSLT脚本都从每个单元行节点的特定部分进行解析,并将XML展平为一个子级别:

R [R

library(xml2)
library(xslt)    # INSTALL PACKAGE BEFORE HAND
library(dplyr)   # ONLY FOR bind_rows

# PARSE XML AND XSLT
doc <- read_xml('Cellosaurus.xml')
scripts <- list.files(path='/path/to/xslt/scripts', pattern='.xsl')

xpaths <- c('//accession', '//cell-line', '//hla_gene', '//marker', 
            '//name', '//species_list', '//url')

proc_xml_parse <- function(x, s) {
  style <- read_xml(s, package = "xslt")

  # TRANSFORM INPUT INTO OUTPUT
  new_xml <- xslt::xml_xslt(doc, style)

  # INNER DF LIST BUILD
  df_list <- lapply(xml_find_all(new_xml, x), function(x) { 
    vals <- xml_children(x)
    setNames(data.frame(t(xml_text(vals)), stringsAsFactors = FALSE), xml_name(vals))
  })

  bind_rows(df_list)
}

# OUTER DF LIST BUILD    
df_list <- Map(proc_xml_parse, xpaths, scripts)

# CHAIN MERGE
final_df <- Reduce(function(x,y) merge(x, y, by="cell_num", all=TRUE), df_list)

XSLT Scripts XSLT脚本

Save each as separate .xsl or .xslt files (special .xml files) to be loaded in R above. 将每个文件另存为单独的.xsl或.xslt文件(特殊.xml文件),以将其加载到上述R中。 Add more XSLT scripts by replicating patterns for other list nodes in XML as below does not capture all. 通过复制XML中其他列表节点的模式来添加更多XSLT脚本,如下所示,它不能捕获全部。

Cell Line List 细胞系列表

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="Cellosaurus">
        <xsl:copy>
            <xsl:apply-templates select="cell-line-list/cell-line"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="cell-line">
        <xsl:copy>
            <cell_num>
                <xsl:value-of select="count(preceding-sibling::*)+1"/>
            </cell_num>
            <xsl:for-each select="@*">
                <xsl:element name="{name(.)}">
                    <xsl:value-of select="."/>
                </xsl:element>
            </xsl:for-each>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

Accession List 加入清单

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="Cellosaurus">
        <xsl:copy>
            <xsl:apply-templates select="cell-line-list/cell-line"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="cell-line">
        <xsl:apply-templates select="accession-list"/>
    </xsl:template>

    <xsl:template match="accession-list">
        <xsl:apply-templates select="accession"/>
    </xsl:template>

    <xsl:template match="accession">
        <xsl:copy>
            <cell_num>
                <xsl:value-of select="count(ancestor::cell-line[1]/preceding-sibling::*)+1"/>
            </cell_num>
            <xsl:for-each select="@*">
                <xsl:element name="{name(.)}">
                    <xsl:value-of select="."/>
                </xsl:element>
            </xsl:for-each>
            <accession_value><xsl:value-of select="."/></accession_value>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

Name List 名单

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="Cellosaurus">
        <xsl:copy>
            <xsl:apply-templates select="cell-line-list/cell-line"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="cell-line">
        <xsl:apply-templates select="name-list"/>
    </xsl:template>

    <xsl:template match="name-list">
        <xsl:apply-templates select="name"/>
    </xsl:template>

    <xsl:template match="name">
        <xsl:copy>
            <cell_num>
                <xsl:value-of select="count(ancestor::cell-line/preceding-sibling::*)+1"/>
            </cell_num>
            <xsl:for-each select="@*">
                <xsl:element name="{name(.)}">
                    <xsl:value-of select="."/>
                </xsl:element>
            </xsl:for-each>
            <name_value><xsl:value-of select="."/></name_value>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

Web Page List 网页列表

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="Cellosaurus">
        <xsl:copy>
            <xsl:apply-templates select="cell-line-list/cell-line"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="cell-line">
        <xsl:apply-templates select="web-page-list"/>
    </xsl:template>

    <xsl:template match="web-page-list">
        <xsl:apply-templates select="url"/>
    </xsl:template>

    <xsl:template match="url">
        <xsl:copy>
            <cell_num>
                <xsl:value-of select="count(ancestor::cell-line/preceding-sibling::*)+1"/>
            </cell_num>
            <xsl:for-each select="@*">
                <xsl:element name="{name(.)}">
                    <xsl:value-of select="."/>
                </xsl:element>
            </xsl:for-each>
            <url_value><xsl:value-of select="."/></url_value>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

HLA List HLA列表

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="Cellosaurus">
        <xsl:copy>
            <xsl:apply-templates select="cell-line-list/cell-line"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="cell-line">
        <xsl:apply-templates select="hla-lists/hla-list"/>
    </xsl:template>

    <xsl:template match="hla-list">
        <xsl:apply-templates select="hla-gene"/>
    </xsl:template>

    <xsl:template match="hla-gene">
        <hla_gene>
            <cell_num>
                <xsl:value-of select="count(ancestor::cell-line/preceding-sibling::*)+1"/>
            </cell_num>
            <xsl:for-each select="@*">
                <xsl:element name="{name(.)}">
                    <xsl:value-of select="."/>
                </xsl:element>
            </xsl:for-each>
            <hla_value><xsl:value-of select="."/></hla_value>
        </hla_gene>
    </xsl:template>

</xsl:stylesheet>

Special List 特别清单

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="Cellosaurus">
        <xsl:copy>
            <xsl:apply-templates select="cell-line-list/cell-line"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="cell-line">
        <xsl:apply-templates select="species-list/cv-term"/>
    </xsl:template>

    <xsl:template match="cv-term">
        <species_list>
            <cell_num>
                <xsl:value-of select="count(ancestor::cell-line/preceding-sibling::*)+1"/>
            </cell_num>
            <xsl:for-each select="@*">
                <xsl:element name="{name(.)}">
                    <xsl:value-of select="."/>
                </xsl:element>
            </xsl:for-each>
            <species_value><xsl:value-of select="."/></species_value>
        </species_list>
    </xsl:template>

</xsl:stylesheet>

Marker List 标记清单

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="Cellosaurus">
        <xsl:copy>
            <xsl:apply-templates select="cell-line-list/cell-line"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="cell-line">
        <xsl:apply-templates select="str-list"/>
    </xsl:template>

    <xsl:template match="str-list">
        <xsl:apply-templates select="marker-list"/>
    </xsl:template>

    <xsl:template match="marker-list">
        <xsl:apply-templates select="marker"/>
    </xsl:template>

    <xsl:template match="marker">
        <xsl:copy>
            <cell_num>
                <xsl:value-of select="count(ancestor::cell-line/preceding-sibling::*)+1"/>
            </cell_num>
            <xsl:for-each select="@*">
                <xsl:element name="{name(.)}">
                    <xsl:value-of select="."/>
                </xsl:element>
            </xsl:for-each>
            <xsl:copy-of select="marker-data-list/marker-data/alleles"/>
        </xsl:copy>
    </xsl:template>        
</xsl:stylesheet>

Output 产量

After chain merge where values repeat for every unique row similar to SQL joins for long data frames (many-to-many). 链合并后,对于长数据帧(多对多),类似于SQL连接的每个唯一行的值重复。 Do note: there is a named list of data frames should you not want below merged output: 请注意:如果不想在合并的输出下面有一个数据帧的命名列表:

数据输出

Just one comment: when you say "~109,000 cell lines with variations in missing data between each cell-line", you need to understand that the only madatory fields in a Cellosaurus entry are the primary accession, the cell line name (identifier), the cell line category and the taxonomy, all the rest are not required. 只需发表一则评论:当您说“〜109,000个细胞系,每个细胞系之间缺失数据的变化”时,您需要了解Cellosaurus条目中唯一的必填字段是主要材料,细胞系名称(标识符),单元格类别和分类,则不需要其余所有内容。 All this is described in the cellosaurus.xsd files either using "minoccurs="0" or use "optional" depending on the type of field. 所有这些都在cellosaurus.xsd文件中描述,根据字段类型,使用“ minoccurs =” 0“或使用” optional“。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM