[英]How to access values of sub-nodes (child) with different names in XML file?
I am trying to parse xmlValue
of certain child nodes from NCBI xml file. 我正在尝试从NCBI xml文件解析某些子节点的
xmlValue
。 But, for some PM.IDs, the Root node <PubmedArticleSet>
has different information wrt pubmed records, PubmedBookArticle
and PubmedArticle
. 但是,对于某些PM.ID,
Root node <PubmedArticleSet>
在发布记录中具有不同的信息,即PubmedBookArticle
和PubmedArticle
。 I would like to pass a condition, if(xmlName(fetch.pubmed) == PubmedBookArticle
extract certain values elseif (xmlName(fetch.pubmed) == PubmedArticle
extract other values. Finally, make a dataframe
with both the values corresponding to their PMIDs. It seems simple, but (xmlName(fetch.pubmed)
throws error no applicable method for 'xmlName' applied to an object of class "c('XMLInternalDocument', 'XMLAbstractDocument')"
Any help is appreciated, thank you 我想通过一个条件,
if(xmlName(fetch.pubmed) == PubmedBookArticle
提取某些值elseif (xmlName(fetch.pubmed) == PubmedArticle
提取其他值。最后,用两个值组成一个dataframe
框,这两个值对应于它们的PMID 。看起来很简单,但是(xmlName(fetch.pubmed)
引发错误, no applicable method for 'xmlName' applied to an object of class "c('XMLInternalDocument', 'XMLAbstractDocument')"
感谢您的任何帮助,谢谢
<?xml version="1.0"?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2015//EN" "http://www.ncbi.nlm.nih.gov/corehtml/query/DTD/pubmed_150101.dtd">
<PubmedArticleSet>
<PubmedBookArticle>
<BookDocument>
<PMID Version="1">25506969</PMID>
<ArticleIdList>
<ArticleId IdType="bookaccession">NBK259188</ArticleId>
</ArticleIdList> ....
...... </BookDocument>
</PubmedBookArticle>
<PubmedArticle>
<MedlineCitation Status="Publisher" Owner="NLM">
<PMID Version="1">25013473</PMID>
<DateCreated>
<Year>2014</Year>
<Month>7</Month>
<Day>11</Day>
</DateCreated>....
....</MedlineCitation>
</PubmedArticle>
</PubmedArticleSet>
My code is below 我的代码如下
library(XML)
library(rentrez)
PM.ID <- c("25506969"," 25032371"," 24983039","24983034","24983032","24983031",
"26386083","26273372","26066373","25837167",
"25466451","25013473")
# rentrez function to retrieve XMl file for above PIMD
fetch.pubmed <- entrez_fetch(db = "pubmed", id = PM.ID,
rettype = "xml", parsed = T)
# If empty records, return NA
FindNull <- function(x,x1child){
res <- xpathSApply(x,x1child,xmlValue)
if (length(res) == 0){
out <- NA
}else {
out <- res
}
out
}
# extract contents from xml file
xpathSApply(fetch.pubmed,"//PubmedArticle",FindNull,x1child = './/ArticleTitle')
xpathSApply(fetch.pubmed,"//PubmedBookArticle",FindNull,x1child = './/BookTitle')
How do I get above code in a loop, so that I can retrieve values within PubmedArticle and PubmedBookArticle as an when the condition is met in each search ? 如何在循环中获得上述代码,以便可以在每次搜索中都满足条件时检索PubmedArticle和PubmedBookArticle中的值?
There are a few ways you could do this, but I would maybe get separate node sets for books and articles. 您可以通过几种方法来执行此操作,但是我可能会获得用于书籍和文章的单独节点集。
table( xpathSApply(fetch.pubmed, "/PubmedArticleSet/*", xmlName) )
PubmedArticle PubmedBookArticle
6 6
books <- getNodeSet(fetch.pubmed, "/PubmedArticleSet/PubmedBookArticle")
data.frame( pmid = sapply(books, function(x) xpathSApply(x, ".//PMID", xmlValue)),
title = sapply(books, function(x) xpathSApply(x, ".//BookTitle", xmlValue))
)
pmid title
1 25506969 Probe Reports from the NIH Molecular Libraries Program
2 25032371 Understanding Climate’s Influence on Human Evolution
3 24983039 Assessing the Effects of the Gulf of Mexico Oil Spill on Human Health: A Summary of the June 2010 Workshop
4 24983034 In the Light of Evolution: Volume IV: The Human Condition
5 24983032 The Role of Human Factors in Home Health Care: Workshop Summary
Below NCBI XML path helps to extract abstracts
from PubmedArticle
, PubmedBookArticle
and as well as those articles without abstracts (NA)
. 下面NCBI XML路径有助于提取
abstracts
从PubmedArticle
, PubmedBookArticle
和以及那些文章without abstracts (NA)
<!-- language: lang-r --> abstracts <- xpathSApply(fetch.pubmed, c('//PubmedArticle//Article', '//PubmedBookArticle//Abstract'), function(x) { xmlValue(xmlChildren(x)$Abstract) }) abstracts <- data.frame(abstracts,stringsAsFactors = F) dim(abstracts) rownames(abstracts) <- PM.ID
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.