简体   繁体   中英

Using R2HTML with rvest/xml2

I was reading this blog post on the new package XML2. Previously, rvest used to depend on XML , and it made a lot of work easier for me (at least) by combining functions in two packages: eg, I would use htmlParse from XML package when I can't read HTML page using html (now they called read_html ).

See this for an example, and then I can use rvest functions like html_nodes , html_attr on the parsed page. Now, with rvest depending on XML2 this is not possible (at least on the surface).

I was just wondering what is the basic difference between XML and XML2. Other than attributing the author of XML package in the post I mentioned earlier, the author of package doesn't explain the differences between XML and XML2.

Another example:

library(R2HTML) #save page as html and read later
library(XML)
k1<-htmlParse("https://stackoverflow.com/questions/30897852/html-in-rvest-verses-htmlparse-in-xml")
head(getHTMLLinks(k1),5) #This works

[1] "//stackoverflow.com"           "http://chat.stackoverflow.com" "http://blog.stackoverflow.com" "//stackoverflow.com"          
[5] "http://meta.stackoverflow.com"

# But, I want to save HTML file now in my working directory and work later

HTML(k1,"k1") #Later I can work with this
rm(k1)
#read stored html file k1
head(getHTMLLinks("k1"),5)#This works too 

[1] "//stackoverflow.com"           "http://chat.stackoverflow.com" "http://blog.stackoverflow.com" "//stackoverflow.com"          
[5] "http://meta.stackoverflow.com"

#with read_html in rvest package, this is not possible (as I know)
library(rvest)
library(R2HTML)
k2<-read_html("https://stackoverflow.com/questions/30897852/html-in-rvest-verses-htmlparse-in-xml")

#This works
df1<-k2 %>%
html_nodes("a")%>%
html_attr("href")

head(df1,5)
[1] "//stackoverflow.com"           "http://chat.stackoverflow.com" "http://blog.stackoverflow.com" "//stackoverflow.com"          
[5] "http://meta.stackoverflow.com"

# But, I want to save HTML file now in my working directory and work later
HTML(k2,"k2") #Later I can work with this
rm(k2,df1)
#Now extract webpages by reading back k2 html file
#This doesn't work
k2<-read_html("k2") 

df1<-k2 %>%
html_nodes("a")%>%
html_attr("href")

df1
character(0)

Updates:

#I have following versions of packages loaded: 
lapply(c("rvest","R2HTML","XML2","XML"),packageVersion)
[[1]]
[1] ‘0.2.0.9000’

[[2]]
[1] ‘2.3.1’

[[3]]
[1] ‘0.1.1’

[[4]]
[1] ‘3.98.1.2’

I am using Windows 8, R 3.2.1., and RStudio 0.99.441.

The R2HTML package just seems to capture.out on the XML object and then writes that back to disk. This doesn't seem like a robust way to save HTML/XML data back to disk. The reason the two might be different is that XML data print out differently than xml2 data. You could define a function to call as.character() rather than relying on capture.output

HTML.xml_document<-function(x, ...) HTML(as.character(x),...)

Or you probably could skip with R2HTML altogether and write out the xml2 data directly with write_xml .

And maybe the best approach would be to download the file first and then import it.

download.file("http://stackoverflow.com/questions/30897852/html-in-rvest-verses-htmlparse-in-xml", "local.html")
k2 <- read_html("local.html")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM