使用R從html頁面提取數據

Question

我嘗試從以下站點提取數據：

https://www.zomato.com/ncr/restaurants/north-indian

使用R編程，我是該領域的學習者和初學者！

我嘗試了這些：

> library(XML)

> doc<-htmlParse("the url mentioned above")

> Warning message:
> XML content does not seem to be XML: 'https://www.zomato.com/ncr/restaurants/north-indian'

這是一個...我也嘗試了readLines() ，輸出如下：

> readLines("the URL as mentioned above") [i can't specify more than two links so typing this]

> Error in file(con, "r") : cannot open the connection

> In addition: Warning message:

> In file(con, "r") : unsupported URL scheme

我了解該頁面不是錯誤顯示的XML，但是還有其他方法可以讓我從該站點捕獲數據...我確實嘗試使用整潔的html將其轉換為XML或XHTML，然后進行處理，但是我無處可去，也許我不知道使用整潔的html的實際過程呢！ :(不確定！建議解決此問題的方法，如果有更正，請更正？

Answer 1

rvest軟件包也非常方便（並且在XML軟件包以及其他軟件包的基礎上構建）：

library(rvest)

pg <- html("https://www.zomato.com/ncr/restaurants/north-indian")

# extract all the restaurant names
pg %>% html_nodes("a.result-title") %>% html_text()

##  [1] "Bukhara - ITC Maurya "                "Karim's "                            
##  [3] "Gulati "                              "Dhaba By Claridges "                 
## ...
## [27] "Dum-Pukht - ITC Maurya "              "Maal Gaadi "                         
## [29] "Sahib Sindh Sultan "                  "My Bar & Restaurant "                

# extract the ratings
pg %>% html_nodes("div.rating-div") %>% html_text() %>% gsub("[[:space:]]", "", .)

##  [1] "4.3" "4.1" "4.2" "3.9" "3.8" "4.1" "4.1" "3.4" "4.1" "4.3" "4.2" "4.2" "3.9" "3.8" "3.8" "3.4" "4.0" "3.7" "4.1"
## [20] "4.0" "3.8" "3.8" "3.9" "3.8" "4.0" "4.0" "4.7" "3.8" "3.8" "3.4"

Answer 2

我建議從RCurl包中獲取getURL以獲取文檔內容。 然后，我們可以使用htmlParse進行解析。 有時htmlParse在處理某些內容時會遇到麻煩。 在這種情況下，建議使用getURL 。

url <- "https://www.zomato.com/ncr/restaurants/north-indian"

library(RCurl)
library(XML)

content <- getURL(url)
doc <- htmlParse(content)

summary(doc)
# $nameCounts
# 
#      div        a       li     span    input  article       h3     meta 
#     1337      362      232      212       33       30       30       27 
#      img   script       ul     link  section        p       br     form 
#       26       21       20       17        7        6        3        3 
#     body   footer       h1     head   header     html noscript       ol 
#        1        1        1        1        1        1        1        1 
#   strong textarea    title 
#        1        1        1 
# 
# $numNodes
# [1] 2377

另外，請注意， readLines不支持https ，因此錯誤消息的影響少一些。

使用R從html頁面提取數據

問題描述

2 個解決方案

解決方案1
4 已采納 2014-12-04 17:32:44

解決方案2
2 2014-12-04 17:27:44

使用R從html頁面提取數據

問題描述

2 個解決方案

解決方案1 4 已采納 2014-12-04 17:32:44

解決方案2 2 2014-12-04 17:27:44

解決方案1
4 已采納 2014-12-04 17:32:44

解決方案2
2 2014-12-04 17:27:44