I tried extracting data from the following site:
https://www.zomato.com/ncr/restaurants/north-indian
using R programming, I'm a learner and beginner in this field!
I tried these:
> library(XML)
> doc<-htmlParse("the url mentioned above")
> Warning message:
> XML content does not seem to be XML: 'https://www.zomato.com/ncr/restaurants/north-indian'
This was one...I also tried the readLines()
to which the output was as follows:-
> readLines("the URL as mentioned above") [i can't specify more than two links so typing this]
> Error in file(con, "r") : cannot open the connection
> In addition: Warning message:
> In file(con, "r") : unsupported URL scheme
I understand that the page is not XML as shown in error stated, but what is other way around for me to capture the data from this site...I did try tidy html to convert it to XML or XHTML and then work it up but I reached nowhere, maybe I don't know the actual process of using tidy html yet! :( not sure! Suggest something to solve this issue and corrections if any are there?
The rvest
package is also pretty handy (and built on top of the XML
package, amongst other packages):
library(rvest)
pg <- html("https://www.zomato.com/ncr/restaurants/north-indian")
# extract all the restaurant names
pg %>% html_nodes("a.result-title") %>% html_text()
## [1] "Bukhara - ITC Maurya " "Karim's "
## [3] "Gulati " "Dhaba By Claridges "
## ...
## [27] "Dum-Pukht - ITC Maurya " "Maal Gaadi "
## [29] "Sahib Sindh Sultan " "My Bar & Restaurant "
# extract the ratings
pg %>% html_nodes("div.rating-div") %>% html_text() %>% gsub("[[:space:]]", "", .)
## [1] "4.3" "4.1" "4.2" "3.9" "3.8" "4.1" "4.1" "3.4" "4.1" "4.3" "4.2" "4.2" "3.9" "3.8" "3.8" "3.4" "4.0" "3.7" "4.1"
## [20] "4.0" "3.8" "3.8" "3.9" "3.8" "4.0" "4.0" "4.7" "3.8" "3.8" "3.4"
I would recommend getURL
from the RCurl
package to get the document content. Then we can parse that with htmlParse
. Sometimes htmlParse
has trouble with certain content. In that case it's recommended to use getURL
.
url <- "https://www.zomato.com/ncr/restaurants/north-indian"
library(RCurl)
library(XML)
content <- getURL(url)
doc <- htmlParse(content)
summary(doc)
# $nameCounts
#
# div a li span input article h3 meta
# 1337 362 232 212 33 30 30 27
# img script ul link section p br form
# 26 21 20 17 7 6 3 3
# body footer h1 head header html noscript ol
# 1 1 1 1 1 1 1 1
# strong textarea title
# 1 1 1
#
# $numNodes
# [1] 2377
Also, note that readLines
does not support https
, so that error message is a little less shocking.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.