简体   繁体   中英

Using readtext to extract text from XML

I am not used to working with XML files but need to extract text from various fields in XML files. Specifically, I've downloaded and saved XML files like the following: https://www.federalregister.gov/documents/full_text/xml/2007/09/18/07-4595.xml . I'm interested in the text within the tag "regtext" in this and other similar XML files.

I've downloaded the XML files and stored them on my computer, but when I set the directory and attempt to use the readtext package to read from the XML files, I get the following error:

regtext <- readtext("/regdata/RegDataValidation", text_field = "regtext")
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : 
  Start tag expected, '<' not found [4]

I've tried to search the error, but nothing I've come across has helped me figure out what might be going on. This basic command works like a charm on any number of other document types, including.csv or.docx, but for some reason it just doesn't seem to recognize the files I'm trying to work with here. Any pointers would be much appreciated--I'm too much of a novice and all of the documentation on readtext does not give examples of how to work with XML.

Pursuant to comments below, I've also tried to specify a single saved XML file, as follows:

> regtext <- readtext("/regdata/RegDataValidation/0579- AC01.xml", text_field = "regtext")
Error in xml2_to_dataframe(xml) : 
  The xml format does not fit for the extraction without xPath
  Use xPath method instead
In addition: There were 50 or more warnings (use warnings() to see the first 50)

I tried to specify an xPath expression on a single file, and this did not return any errors, but didn't actually extract any text (even though there should be plenty of text within the "regtext" node:

> regtext <- readtext("/regdata/RegDataValidation/0579- AC01.xml", text_field = "/regtext/*")

I end up with a dataframe with the correct doc_id, but no text.

From the error messages, the readtext function appears to be converting the xml file into a plain text document and the XML package is not accepting it as a valid document.

It is also likely that the XML parser is differentiating between "regtext" and "REGTEXT".

Here is a solution using the xml2 package. (I find this package provides a simpler interface and is easier to use)

library(xml2)

url <- "https://www.federalregister.gov/documents/full_text/xml/2007/09/18/07-4595.xml"
page <- read_xml(url)

#parse out the nodes within the "REGTEXT" sections
regtext <- xml_find_all(page, ".//REGTEXT")

#convert the regtext nodes into vector of strings
xml_text(regtext)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM