简体   繁体   English

将 XML 数据转换为 R 中的数据框

[英]convert XML data to data frame in R

I am trying to convert XML files to data frame, but it only shows few information in the column.我正在尝试将 XML 文件转换为数据框,但它只在列中显示很少的信息。

library(XML)

# LOADING TRANSFORMED XML INTO R DATA FRAME
doc <- xmlParse("SRR12545290.xml") # https://www.ncbi.nlm.nih.gov/sra/?term=SRR12545290
xmldf <- xmlToDataFrame(doc)
head(xmldf)

This only shows这仅显示

 │EXPERIMENT                                                                                                               
1│SRX903458416S amplicon of  Atlantic salmon: distal intestinal digestaSRP279301Illumina 16S metagenomic targeted sequenci…
 │SUBMISSION
1│SRA1118818
 │Organization                                                                                                             
1│Norwegian university of life scienceDepartment of Paraclinical SciencesNorwegian university of life scienceNO-0033OsloNo…
 │STUDY                                                                                                                    
1│SRP279301PRJNA660116ArcticFloraDiet with or without functional feed ingredients were fed to Atlantic salmon through fres…
 │SAMPLE                                                                                                                   
1│SRS7285186SAMN15936598FW-Ref749906gut metagenome['Distal intestinal digesta of Atlantic salmon', 'Distal intestinal dige…
 │Pool                  │RUN_SET                          
1│SRS7285186SAMN15936598│SRR12545290SRS7285186SAMN15936598

But instead, I wanted to get all the information present in the XML file.但相反,我想获取 XML 文件中存在的所有信息。 Like geographic location, host name etc.如地理位置、主机名等。

Here is an approach to parse the entire XML (using the xml2 package) into obtain the values of all of the leaf nodes along with the path name.这是一种解析整个 XML(使用 xml2 包)以获取所有叶节点的值以及路径名的方法。
Not sure if this is what you were looking for but a start.不确定这是否是您要找的东西,但这只是一个开始。

library(xml2)
library(dplyr)    
doc<-read_xml("SRR12545290.xml")


#find all the nodes
allnodes <- doc %>% xml_find_all( '//*')

#find the leafs
leafs <- which( (allnodes %>% xml_children() %>% xml_length())==0)

#get the value in the leafs
value <- (allnodes %>% xml_text())[leafs]

#get the path to leaves to indentify the source
name <- (allnodes %>% xml_path())[leafs]
   
#clean up naming
name <- gsub("/EXPERIMENT_PACKAGE_SET/EXPERIMENT_PACKAGE/", "", name)

#final result
data.frame(name, value)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM