简体   繁体   English

lapply在xml对象列表上

[英]lapply on list of xml objects

this is my first post here, so please forgive any mistakes wrt the posting guidelines 这是我在这里的第一篇文章,所以请原谅发布指南中的任何错误

I'm trying to read in xml data from pubmed, to extract data on author affiliations 我正在尝试从pubmed读取xml数据,以提取作者关联的数据

each entry contains a set of nodes like so: 每个条目都包含一组节点,如下所示:

<AuthorList>
          <Author>
            <LastName>Serra-Blasco</LastName>
            <ForeName>Maria</ForeName>
            <Initials>M</Initials>
            <AffiliationInfo>
              <Affiliation>Department of Psychiatry, Hospital de la Santa Creu i Sant Pau, Biomedical Research Institute Sant Pau (IIB Sant Pau), Universitat Autònoma de Barcelona (UAB), Centro de Investigación Biomédica en Red de Salud Mental (CIBERSAM), Barcelona, Catalonia, Spain.</Affiliation>
            </AffiliationInfo>
          </Author>
          ...

I would like to end up with a dataframe that contains each authors name and affiliation in a row. 我想最终得到一个数据框,其中包含连续的每个作者姓名和从属关系。

I tried to do this using xpathSApply to parse nodes reading "//Author", and ended up with a list of xml nodes. 我尝试使用xpathSApply来解析读取“// Author”的节点,最后得到一个xml节点列表。

Further parsing is proving to be a problem: i've written code that works on an individual element of this list; 进一步解析被证明是一个问题:我编写的代码适用于此列表的单个元素;

for eg if the list is authorlist 例如,如果列表是authorlist列表

I can extract an appropriate array for authorlist[[1]] using this function (that uses xpathSApply within the element) 我可以使用这个函数(在元素中使用xpathSApply)为authorlist[[1]]提取一个合适的数组

But when I try to wrap lapply around this function, it gives me an error that says that it cannot perform xpathApply on a list. 但是当我尝试在这个函数周围进行包装时,它会给出一个错误,指出它无法在列表上执行xpathApply。 The exact error call is: 确切的错误调用是:

Error in UseMethod("xpathApply") : no applicable method for 'xpathApply' applied to an object of class "list" UseMethod(“xpathApply”)中的错误:没有适用于“xpathApply”的方法应用于类“list”的对象

I surmise that lapply calls the list subsetting with the equivalent of [i] whereas what I need is [[i]]. 我推测lapply称之为[i]等价的列表子集,而我需要的是[[i]]。 Is there a way around this? 有没有解决的办法? Or will I have to rewrite with some other rules in mind? 或者我是否必须重写其他一些规则?

I'm open to rewriting (this is just some goofing around I'm doing) but this problem seemed interesting, hope you can help! 我愿意改写(这只是我正在做的一些事情),但这个问题似乎很有意思,希望你能帮忙!

I prefer using the package rvest when working with html/xml files. 在使用html / xml文件时,我更喜欢使用rvest包。 Based on your simple example: 基于您的简单示例:

library(rvest)
myxml<-read_xml("author.xml")

lastname<-xml_text(xml_nodes(myxml,"LastName"))
firstname<-xml_text(xml_nodes(myxml,"ForeName"))
affiliation<-xml_text(xml_nodes(myxml,"Affiliation"))
df<-data.frame(firstname, lastname, affiliation)

If the structure of the xml file changes, then then call to data.frame command will error and some additional work is required to properly parse the file. 如果xml文件的结构发生更改,则调用data.frame命令将出错,并且需要一些额外的工作才能正确解析该文件。

It would help to show your code that produced the error, but you could try xmlToDataFrame 这将有助于显示产生错误的代码,但您可以尝试xmlToDataFrame

url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=23620451&rettype=XML"
doc <- xmlParse(url)

xmlToDataFrame(doc["//Author"])
           LastName ForeName Initials                   AffiliationInfo
1      Serra-Blasco    Maria        M Department of Psychiatry...Spain.
2          Portella  Maria J       MJ                              <NA>
3       Gómez-Ansón  Beatriz        B                              <NA>
...

If you get nodes that have zero or many tags, I usually create a function to set missing tags to NA and a delimiter for joining multiple tags. 如果您获得具有零个或多个标记的节点,我通常会创建一个函数来将缺少的标记设置为NA,并使用分隔符来连接多个标记。

authors <- getNodeSet(doc, "//Author")

xpath2 <-function(x, path){
     y <- xpathSApply(x, path, xmlValue)
     ifelse(length(y)==0, NA, 
        ifelse(length(y)>1, paste(y, collapse=", "), y))
}

last <- sapply(authors, xpath2, ".//LastName")
aff <- sapply(authors, xpath2, ".//Affiliation")
data.frame(last, aff)
               last                               aff
1      Serra-Blasco Department of Psychiatry...Spain.
2          Portella                              <NA>
3       Gómez-Ansón                              <NA>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM