从xml和制表符分隔的文本中提取数据

Question

I would like to extract data from this text blob. 我想从此文本Blob中提取数据。 This text contains both tab-delimited text and xml tagged text. 此文本包含制表符分隔的文本和xml标记的文本。 I would like to extract the xml blob and parse it separately for my analysis. 我想提取xml blob并将其分别解析以进行分析。

Text1   Text2   text3   text4   text4   <Assessment>
  <Questions>
    <Question>
      <Id>1</Id>
      <Key>Instructions</Key>
      <QuestionText>Your Age</QuestionText>
      <QuestionType>Label</QuestionType>
      <Answer>16-30</Answer>
    </Question>
  </Questions>
</Assessment>   text5
Text1   Text2   text3   text4   text4   <Assessment>
  <Questions>
    <Question>
      <Id>1</Id>
      <Key>Instructions</Key>
      <QuestionText>Your Age</QuestionText>
      <QuestionType>Label</QuestionType>
      <Answer>31-49</Answer>
    </Question>
  </Questions>
</Assessment>   text5

I have read the text using readlines and did the following. 我已使用readlines阅读了文字，并进行了以下操作。

tst<-gsub("^\\s+","", tst)
idx<-which(grepl("+<Assessment>+", tst))
tst[idx]<-"<Assessment>"
idx<-which(grepl("</Assessment>", tst))
tst[idx]<-"</Assessment>"

Still haven't figured out how to parse it using XML. 仍未弄清楚如何使用XML进行解析。

Answer 1

You may want to have a try of 您可能想尝试一下

getNodeSet getNodeSet

from XML package http://www.inside-r.org/packages/cran/xml/docs/matchNamespaces 来自XML包http://www.inside-r.org/packages/cran/xml/docs/matchNamespaces

从xml和制表符分隔的文本中提取数据

问题描述

1 个解决方案

解决方案1
1 2015-12-15 09:48:17

从xml和制表符分隔的文本中提取数据

问题描述

1 个解决方案

解决方案1 1 2015-12-15 09:48:17

解决方案1
1 2015-12-15 09:48:17