[英]Extract data from xml and tab-delimited text
I would like to extract data from this text blob. 我想从此文本Blob中提取数据。 This text contains both tab-delimited text and xml tagged text. 此文本包含制表符分隔的文本和xml标记的文本。 I would like to extract the xml blob and parse it separately for my analysis. 我想提取xml blob并将其分别解析以进行分析。
Text1 Text2 text3 text4 text4 <Assessment>
<Questions>
<Question>
<Id>1</Id>
<Key>Instructions</Key>
<QuestionText>Your Age</QuestionText>
<QuestionType>Label</QuestionType>
<Answer>16-30</Answer>
</Question>
</Questions>
</Assessment> text5
Text1 Text2 text3 text4 text4 <Assessment>
<Questions>
<Question>
<Id>1</Id>
<Key>Instructions</Key>
<QuestionText>Your Age</QuestionText>
<QuestionType>Label</QuestionType>
<Answer>31-49</Answer>
</Question>
</Questions>
</Assessment> text5
I have read the text using readlines
and did the following. 我已使用readlines
阅读了文字,并进行了以下操作。
tst<-gsub("^\\s+","", tst)
idx<-which(grepl("+<Assessment>+", tst))
tst[idx]<-"<Assessment>"
idx<-which(grepl("</Assessment>", tst))
tst[idx]<-"</Assessment>"
Still haven't figured out how to parse it using XML. 仍未弄清楚如何使用XML进行解析。
You may want to have a try of 您可能想尝试一下
getNodeSet getNodeSet
from XML package http://www.inside-r.org/packages/cran/xml/docs/matchNamespaces 来自XML包http://www.inside-r.org/packages/cran/xml/docs/matchNamespaces
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.