繁体   English   中英

使用R从html页面提取数据

[英]Extracting data from html pages using R

我尝试从以下站点提取数据:

https://www.zomato.com/ncr/restaurants/north-indian

使用R编程,我是该领域的学习者和初学者!

我尝试了这些:

> library(XML)

> doc<-htmlParse("the url mentioned above")

> Warning message:
> XML content does not seem to be XML: 'https://www.zomato.com/ncr/restaurants/north-indian' 

这是一个...我也尝试了readLines() ,输出如下:

> readLines("the URL as mentioned above") [i can't specify more than two links so typing this]

> Error in file(con, "r") : cannot open the connection

> In addition: Warning message:

> In file(con, "r") : unsupported URL scheme

我了解该页面不是错误显示的XML,但是还有其他方法可以让我从该站点捕获数据...我确实尝试使用整洁的html将其转换为XML或XHTML,然后进行处理,但是我无处可去,也许我不知道使用整洁的html的实际过程呢! :(不确定!建议解决此问题的方法,如果有更正,请更正?

rvest软件包也非常方便(并且在XML软件包以及其他软件包的基础上构建):

library(rvest)

pg <- html("https://www.zomato.com/ncr/restaurants/north-indian")

# extract all the restaurant names
pg %>% html_nodes("a.result-title") %>% html_text()

##  [1] "Bukhara - ITC Maurya "                "Karim's "                            
##  [3] "Gulati "                              "Dhaba By Claridges "                 
## ...
## [27] "Dum-Pukht - ITC Maurya "              "Maal Gaadi "                         
## [29] "Sahib Sindh Sultan "                  "My Bar & Restaurant "                

# extract the ratings
pg %>% html_nodes("div.rating-div") %>% html_text() %>% gsub("[[:space:]]", "", .)

##  [1] "4.3" "4.1" "4.2" "3.9" "3.8" "4.1" "4.1" "3.4" "4.1" "4.3" "4.2" "4.2" "3.9" "3.8" "3.8" "3.4" "4.0" "3.7" "4.1"
## [20] "4.0" "3.8" "3.8" "3.9" "3.8" "4.0" "4.0" "4.7" "3.8" "3.8" "3.4"

我建议从RCurl包中获取getURL以获取文档内容。 然后,我们可以使用htmlParse进行解析。 有时htmlParse在处理某些内容时会遇到麻烦。 在这种情况下,建议使用getURL

url <- "https://www.zomato.com/ncr/restaurants/north-indian"

library(RCurl)
library(XML)

content <- getURL(url)
doc <- htmlParse(content)

summary(doc)
# $nameCounts
# 
#      div        a       li     span    input  article       h3     meta 
#     1337      362      232      212       33       30       30       27 
#      img   script       ul     link  section        p       br     form 
#       26       21       20       17        7        6        3        3 
#     body   footer       h1     head   header     html noscript       ol 
#        1        1        1        1        1        1        1        1 
#   strong textarea    title 
#        1        1        1 
# 
# $numNodes
# [1] 2377

另外,请注意, readLines不支持https ,因此错误消息的影响少一些。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM