简体   繁体   中英

Read xml file without .xml extension into R

I have a problem reading xml file into R. The problem is, that this xml file does not have a .xml extension.

I would usually follow the approach described below:

library(XML)

xml.url <- "http://www.w3schools.com/xml/plant_catalog.xml"

Use the xmlTreeParse and readLines function to parse xml file:

xmlfile <- xmlTreeParse(readLines(xml.url))

However, I have no idea how to parse the content from the web page below. It has no .xml extension.

my_file <- 
  paste0("http://ec.europa.eu/public_opinion/cf/",
         "exp_feed.cfm?keyID=1&nationID=",
         "11,1,27,28,17,2,16,18,13,32,6,3,4,",
         "22,33,7,8,20,21,9,23,31,34,24,12,19,",
         "35,29,26,25,5,14,10,30,15,",
         "&startdate=1973.09&enddate=",
         "2014.06")

my_xml_file <- xmlTreeParse(readLines(my_file))

I get this error:

Input is not proper UTF-8, indicate encoding !
Bytes: 0xE7 0x6F 0x6E 0x20
Error: 1: Input is not proper UTF-8, indicate encoding !
Bytes: 0xE7 0x6F 0x6E 0x20

So, the web page does not have an extension, and the parsing throws an error that is related to encoding. I tried my luck with encoding argument in functions above...no luck.

Try getting it into R first using httr and then letting the content function spit it out into a more usable format:

library('httr')
my_file <- 
  paste0("http://ec.europa.eu/public_opinion/cf/",
         "exp_feed.cfm?keyID=1&nationID=",
         "11,1,27,28,17,2,16,18,13,32,6,3,4,",
         "22,33,7,8,20,21,9,23,31,34,24,12,19,",
         "35,29,26,25,5,14,10,30,15,",
         "&startdate=1973.09&enddate=",
         "2014.06")
x <- GET(my_file)
z <- xmlToList(content(x))

Result:

> str(z, 3)
List of 1
 $ Table:List of 2
  ..$ Grid       :List of 35
  .. ..$ AxisZ:List of 2
  .. ..$ AxisZ:List of 2
  .. ..$ AxisZ:List of 2
  .. ..$ AxisZ:List of 2
  .. ..$ AxisZ:List of 2
  .. ..$ AxisZ:List of 2
  .. ..$ AxisZ:List of 2
  .. ..$ AxisZ:List of 2
  .. ..$ AxisZ:List of 2
  .. ..$ AxisZ:List of 2

This isn't related to the lack of xml extension. That doesn't really matter.

The problem seems to be with the encoding of the file. Things seem to get funny in this region:

xx <- readLines(my_file); 
xx[114633:114646]

The XML parser does not believe that is proper UTF-8 encoding

You can convert the data in R with

yy <- iconv(ll, to="UTF-8")
my_xml_file <- xmlTreeParse(yy)

Note : This will take out the lines with bad bytes. This means you will be missing data. The rows that are lost are

which(is.na(yy))
# [1] 114637 114643 114685 114755 114776 114832 
# [7] 114881 114895 114902 115422 115429 115436

so its the same as

my_xml_file <- xmlTreeParse(xx[-which(is.na(yy))])

Luckily your file still parses without the missing lines.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM