I have a problem reading xml file into R. The problem is, that this xml file does not have a .xml extension.
I would usually follow the approach described below:
library(XML)
xml.url <- "http://www.w3schools.com/xml/plant_catalog.xml"
Use the xmlTreeParse and readLines function to parse xml file:
xmlfile <- xmlTreeParse(readLines(xml.url))
However, I have no idea how to parse the content from the web page below. It has no .xml extension.
my_file <-
paste0("http://ec.europa.eu/public_opinion/cf/",
"exp_feed.cfm?keyID=1&nationID=",
"11,1,27,28,17,2,16,18,13,32,6,3,4,",
"22,33,7,8,20,21,9,23,31,34,24,12,19,",
"35,29,26,25,5,14,10,30,15,",
"&startdate=1973.09&enddate=",
"2014.06")
my_xml_file <- xmlTreeParse(readLines(my_file))
I get this error:
Input is not proper UTF-8, indicate encoding !
Bytes: 0xE7 0x6F 0x6E 0x20
Error: 1: Input is not proper UTF-8, indicate encoding !
Bytes: 0xE7 0x6F 0x6E 0x20
So, the web page does not have an extension, and the parsing throws an error that is related to encoding. I tried my luck with encoding argument in functions above...no luck.
Try getting it into R first using httr and then letting the content
function spit it out into a more usable format:
library('httr')
my_file <-
paste0("http://ec.europa.eu/public_opinion/cf/",
"exp_feed.cfm?keyID=1&nationID=",
"11,1,27,28,17,2,16,18,13,32,6,3,4,",
"22,33,7,8,20,21,9,23,31,34,24,12,19,",
"35,29,26,25,5,14,10,30,15,",
"&startdate=1973.09&enddate=",
"2014.06")
x <- GET(my_file)
z <- xmlToList(content(x))
Result:
> str(z, 3)
List of 1
$ Table:List of 2
..$ Grid :List of 35
.. ..$ AxisZ:List of 2
.. ..$ AxisZ:List of 2
.. ..$ AxisZ:List of 2
.. ..$ AxisZ:List of 2
.. ..$ AxisZ:List of 2
.. ..$ AxisZ:List of 2
.. ..$ AxisZ:List of 2
.. ..$ AxisZ:List of 2
.. ..$ AxisZ:List of 2
.. ..$ AxisZ:List of 2
This isn't related to the lack of xml
extension. That doesn't really matter.
The problem seems to be with the encoding of the file. Things seem to get funny in this region:
xx <- readLines(my_file);
xx[114633:114646]
The XML parser does not believe that is proper UTF-8 encoding
You can convert the data in R with
yy <- iconv(ll, to="UTF-8")
my_xml_file <- xmlTreeParse(yy)
Note : This will take out the lines with bad bytes. This means you will be missing data. The rows that are lost are
which(is.na(yy))
# [1] 114637 114643 114685 114755 114776 114832
# [7] 114881 114895 114902 115422 115429 115436
so its the same as
my_xml_file <- xmlTreeParse(xx[-which(is.na(yy))])
Luckily your file still parses without the missing lines.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.