简体   繁体   English

如何将 xml 节点和关键值提取到 R studio 中的 data.frame,包括 NA 值?

[英]How can I extract xml nodes and key values to data.frame in R studio, including NA values?

Data sample contain words (orth) and kategories (prop key="sense:ukb:unitsstr").数据样本包含单词(orth)和类别(prop key="sense:ukb:unitsstr")。 I'd like to extract pairs of data such as orth and prop key="sense:ukb:unitsstr as a row to data frame. However, some words may not have any prop data, just like two last records. Then I'd like to see them as NA.我想提取成对数据,例如 orth 和 prop key="sense:ukb:unitsstr 作为一行到数据框。但是,有些词可能没有任何道具数据,就像最后两条记录一样。然后我会喜欢将他们视为 NA。

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chunkList SYSTEM "ccl.dtd">
<chunkList>
 <chunk id="ch1" type="p">
  <sentence id="s1">
   <tok>
    <orth>ktoś</orth>
    <lex disamb="1"><base>ktoś</base><ctag>subst:sg:nom:m1</ctag></lex>
    <prop key="polarity">0</prop>
    <prop key="sense:ukb:syns_id">11511</prop>
    <prop key="sense:ukb:syns_rank">11511/128.6156573170 243094/95.1234745165</prop>
    <prop key="sense:ukb:unitsstr">ktoś.2(15:os)</prop>
   </tok>
   <tok>
    <orth>go</orth>
    <lex disamb="1"><base>go</base><ctag>subst:sg:nom:n</ctag></lex>
    <prop key="polarity">0</prop>
    <prop key="sense:ukb:syns_id">47620</prop>
    <prop key="sense:ukb:syns_rank">47620/108.9010709884 234524/90.4766173102</prop>
    <prop key="sense:ukb:unitsstr">go.1(2:czy)</prop>
   </tok>
   <tok>
    <orth>krokodyl</orth>
    <lex disamb="1"><base>krokodyl</base><ctag>subst:sg:nom:m2</ctag></lex>
    <prop key="polarity">0</prop>
    <prop key="sense:ukb:syns_id">12879</prop>
    <prop key="sense:ukb:syns_rank">12879/40.5162836207 254796/35.9915058408 7063215/33.3657479890 7063214/26.6770712118 7063217/25.5775738130 7063213/23.6851347572 7063212/23.6300037076</prop>
    <prop key="sense:ukb:unitsstr">krokodyl.1(21:zw) krokodyl_właściwy.1(21:zw)</prop>
   </tok>
   <tok>
    <orth>się</orth>
    <lex disamb="1"><base>się</base><ctag>qub</ctag></lex>
   </tok>
   <tok>
    <orth>ja</orth>
    <lex disamb="1"><base>ja</base><ctag>ppron12:sg:nom:m1:pri</ctag></lex>
   </tok>

I assumed that I can get it with some xml path lines, but I got stuck:我以为我可以通过一些 xml 路径行来获取它,但我被卡住了:

doc = xmlTreeParse("statsUCZESTxfreqkeyword xml.txt",useInternal = TRUE)
top = xmlRoot(doc)
xmlName(top)
names(top) 
names( top[[ 1 ]] )
sent <- top[[ 1 ]] [[ "sentence" ]]
names(sent)
names(sent[[1]])
xmlSApply(sent[[1]], xmlValue)
xmlSApply(sent, function(x) xmlSApply(x, xmlValue))
nodes = getNodeSet(top, "//prop[@key='sense:ukb:unitsstr']")
lapply(nodes, function(x) xmlSApply(x, xmlValue)) # 152 words have prop
xmlSApply(sent, function(x) xmlSApply(x, xmlValue))

Here is a solution using the xml2 library.这是使用 xml2 库的解决方案。 I find the syntax of xml2 to be easier that the xml library.我发现 xml2 的语法比 xml 库更容易。 Both have their advantages and disadvantages.两者都有其优点和缺点。
The logic is similar to the answer I provided here: rvest: Return NAs for empty nodes given multiple listings .逻辑类似于我在此处提供的答案: rvest: Return NAs for empty nodes given multiple listings The code's comments explain each step.代码的注释解释了每个步骤。 In the code below xmltext is either the xml text or the filename of the xml which you would like to process.在下面的代码中, xmltext是您要处理的 xml 文本或 xml 的文件名。

library(xml2)

#read the xml page
page<-read_xml(xmltext)
#find the listing nodes and id of each node
listings<-xml_find_all(page, ".//tok")

#find the text associated with the orth nodes
orthtext <- xml_text(xml_find_first(listings, ".//orth"))

#find text associated with the prop key="sense:ukb:unitsstr"
ukb<-sapply(listings, function(x){ nodes<-xml_find_all(x, ".//prop")
                            #find node with wanted key
                           wantednode<-nodes[xml_attr(nodes, "key" )=="sense:ukb:unitsstr"]
                           #extract text
                           wantednode<-xml_text(wantednode)
                           #return NA if node is empty.
                           ifelse(is.character(wantednode), wantednode, NA)
})


#create dataframe
finalanswer<-data.frame(orthtext, ukb)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM