使用 readtext 从 XML 中提取文本

Question

I am not used to working with XML files but need to extract text from various fields in XML files.我不习惯使用 XML 文件，但需要从 XML 文件的各个字段中提取文本。 Specifically, I've downloaded and saved XML files like the following: https://www.federalregister.gov/documents/full_text/xml/2007/09/18/07-4595.xml . Specifically, I've downloaded and saved XML files like the following: https://www.federalregister.gov/documents/full_text/xml/2007/09/18/07-4595.xml . I'm interested in the text within the tag "regtext" in this and other similar XML files.我对这个和其他类似 XML 文件中标签“regtext”中的文本感兴趣。

I've downloaded the XML files and stored them on my computer, but when I set the directory and attempt to use the readtext package to read from the XML files, I get the following error:我已经下载了 XML 文件并将它们存储在我的计算机上，但是当我设置目录并尝试使用 readtext package 从 XML 文件中读取时，我得到以下错误：

regtext <- readtext("/regdata/RegDataValidation", text_field = "regtext")
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : 
  Start tag expected, '<' not found [4]

I've tried to search the error, but nothing I've come across has helped me figure out what might be going on.我试图搜索错误，但我遇到的任何事情都无法帮助我弄清楚可能发生了什么。 This basic command works like a charm on any number of other document types, including.csv or.docx, but for some reason it just doesn't seem to recognize the files I'm trying to work with here.这个基本命令对任何数量的其他文档类型都很有效，包括.csv 或.docx，但由于某种原因，它似乎无法识别我在这里尝试使用的文件。 Any pointers would be much appreciated--I'm too much of a novice and all of the documentation on readtext does not give examples of how to work with XML.任何指针都将不胜感激——我太新手了，所有关于 readtext 的文档都没有给出如何使用 XML 的示例。

Pursuant to comments below, I've also tried to specify a single saved XML file, as follows:根据下面的评论，我还尝试指定一个保存的 XML 文件，如下所示：

> regtext <- readtext("/regdata/RegDataValidation/0579- AC01.xml", text_field = "regtext")
Error in xml2_to_dataframe(xml) : 
  The xml format does not fit for the extraction without xPath
  Use xPath method instead
In addition: There were 50 or more warnings (use warnings() to see the first 50)

I tried to specify an xPath expression on a single file, and this did not return any errors, but didn't actually extract any text (even though there should be plenty of text within the "regtext" node:我试图在单个文件上指定 xPath 表达式，这没有返回任何错误，但实际上没有提取任何文本（即使“regtext”节点中应该有大量文本：

> regtext <- readtext("/regdata/RegDataValidation/0579- AC01.xml", text_field = "/regtext/*")

I end up with a dataframe with the correct doc_id, but no text.我最终得到一个 dataframe 具有正确的 doc_id，但没有文本。

Answer 1

From the error messages, the readtext function appears to be converting the xml file into a plain text document and the XML package is not accepting it as a valid document. From the error messages, the readtext function appears to be converting the xml file into a plain text document and the XML package is not accepting it as a valid document.

It is also likely that the XML parser is differentiating between "regtext" and "REGTEXT". XML 解析器也可能区分“regtext”和“REGTEXT”。

Here is a solution using the xml2 package.这是使用 xml2 package 的解决方案。 (I find this package provides a simpler interface and is easier to use) （我发现这个 package 提供了更简单的界面，更容易使用）

library(xml2)

url <- "https://www.federalregister.gov/documents/full_text/xml/2007/09/18/07-4595.xml"
page <- read_xml(url)

#parse out the nodes within the "REGTEXT" sections
regtext <- xml_find_all(page, ".//REGTEXT")

#convert the regtext nodes into vector of strings
xml_text(regtext)

使用 readtext 从 XML 中提取文本

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-05-27 00:53:49

使用 readtext 从 XML 中提取文本

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-05-27 00:53:49

解决方案1
0 已采纳 2021-05-27 00:53:49