简体   繁体   English

如何在XML文件中访问具有不同名称的子节点(子节点)的值?

[英]How to access values of sub-nodes (child) with different names in XML file?

I am trying to parse xmlValue of certain child nodes from NCBI xml file. 我正在尝试从NCBI xml文件解析某些子节点的xmlValue But, for some PM.IDs, the Root node <PubmedArticleSet> has different information wrt pubmed records, PubmedBookArticle and PubmedArticle . 但是,对于某些PM.ID, Root node <PubmedArticleSet>在发布记录中具有不同的信息,即PubmedBookArticlePubmedArticle I would like to pass a condition, if(xmlName(fetch.pubmed) == PubmedBookArticle extract certain values elseif (xmlName(fetch.pubmed) == PubmedArticle extract other values. Finally, make a dataframe with both the values corresponding to their PMIDs. It seems simple, but (xmlName(fetch.pubmed) throws error no applicable method for 'xmlName' applied to an object of class "c('XMLInternalDocument', 'XMLAbstractDocument')" Any help is appreciated, thank you 我想通过一个条件, if(xmlName(fetch.pubmed) == PubmedBookArticle提取某些值elseif (xmlName(fetch.pubmed) == PubmedArticle提取其他值。最后,用两个值组成一个dataframe框,这两个值对应于它们的PMID 。看起来很简单,但是(xmlName(fetch.pubmed)引发错误, no applicable method for 'xmlName' applied to an object of class "c('XMLInternalDocument', 'XMLAbstractDocument')"感谢您的任何帮助,谢谢

<?xml version="1.0"?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2015//EN" "http://www.ncbi.nlm.nih.gov/corehtml/query/DTD/pubmed_150101.dtd">
<PubmedArticleSet>
  <PubmedBookArticle>
    <BookDocument>
      <PMID Version="1">25506969</PMID>
      <ArticleIdList>
        <ArticleId IdType="bookaccession">NBK259188</ArticleId>
      </ArticleIdList> ....

   ...... </BookDocument>
  </PubmedBookArticle>

  <PubmedArticle>
    <MedlineCitation Status="Publisher" Owner="NLM">
      <PMID Version="1">25013473</PMID>
      <DateCreated>
        <Year>2014</Year>
        <Month>7</Month>
        <Day>11</Day>
      </DateCreated>....

    ....</MedlineCitation>
    </PubmedArticle>
</PubmedArticleSet>

My code is below 我的代码如下

library(XML)
library(rentrez)

PM.ID <- c("25506969"," 25032371","   24983039","24983034","24983032","24983031",
"26386083","26273372","26066373","25837167",
 "25466451","25013473")
# rentrez function to retrieve XMl file for above PIMD
fetch.pubmed <- entrez_fetch(db = "pubmed", id = PM.ID,
                             rettype = "xml", parsed = T)
# If empty records, return NA
FindNull <- function(x,x1child){
  res <- xpathSApply(x,x1child,xmlValue)
  if (length(res) == 0){
    out <- NA
  }else {
    out <- res
  }
  out
}

# extract contents from xml file
    xpathSApply(fetch.pubmed,"//PubmedArticle",FindNull,x1child = './/ArticleTitle')

    xpathSApply(fetch.pubmed,"//PubmedBookArticle",FindNull,x1child = './/BookTitle')

How do I get above code in a loop, so that I can retrieve values within PubmedArticle and PubmedBookArticle as an when the condition is met in each search ? 如何在循环中获得上述代码,以便可以在每次搜索中都满足条件时检索PubmedArticle和PubmedBookArticle中的值?

There are a few ways you could do this, but I would maybe get separate node sets for books and articles. 您可以通过几种方法来执行此操作,但是我可能会获得用于书籍和文章的单独节点集。

table( xpathSApply(fetch.pubmed, "/PubmedArticleSet/*", xmlName) )
    PubmedArticle PubmedBookArticle 
                6                 6 

books <- getNodeSet(fetch.pubmed, "/PubmedArticleSet/PubmedBookArticle")

data.frame( pmid = sapply(books, function(x) xpathSApply(x, ".//PMID", xmlValue)),
           title = sapply(books, function(x) xpathSApply(x, ".//BookTitle", xmlValue))
)

      pmid                                                                                                      title
1 25506969                                                     Probe Reports from the NIH Molecular Libraries Program
2 25032371                                                       Understanding Climate’s Influence on Human Evolution
3 24983039 Assessing the Effects of the Gulf of Mexico Oil Spill on Human Health: A Summary of the June 2010 Workshop
4 24983034                                                  In the Light of Evolution: Volume IV: The Human Condition
5 24983032                                            The Role of Human Factors in Home Health Care: Workshop Summary
  • Below NCBI XML path helps to extract abstracts from PubmedArticle , PubmedBookArticle and as well as those articles without abstracts (NA) . 下面NCBI XML路径有助于提取abstractsPubmedArticlePubmedBookArticle和以及那些文章without abstracts (NA)

     <!-- language: lang-r --> abstracts <- xpathSApply(fetch.pubmed, c('//PubmedArticle//Article', '//PubmedBookArticle//Abstract'), function(x) { xmlValue(xmlChildren(x)$Abstract) }) abstracts <- data.frame(abstracts,stringsAsFactors = F) dim(abstracts) rownames(abstracts) <- PM.ID 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM