[英]Extracting data from html pages using R
我嘗試從以下站點提取數據:
https://www.zomato.com/ncr/restaurants/north-indian
使用R編程,我是該領域的學習者和初學者!
我嘗試了這些:
> library(XML)
> doc<-htmlParse("the url mentioned above")
> Warning message:
> XML content does not seem to be XML: 'https://www.zomato.com/ncr/restaurants/north-indian'
這是一個...我也嘗試了readLines()
,輸出如下:
> readLines("the URL as mentioned above") [i can't specify more than two links so typing this]
> Error in file(con, "r") : cannot open the connection
> In addition: Warning message:
> In file(con, "r") : unsupported URL scheme
我了解該頁面不是錯誤顯示的XML,但是還有其他方法可以讓我從該站點捕獲數據...我確實嘗試使用整潔的html將其轉換為XML或XHTML,然后進行處理,但是我無處可去,也許我不知道使用整潔的html的實際過程呢! :(不確定!建議解決此問題的方法,如果有更正,請更正?
rvest
軟件包也非常方便(並且在XML
軟件包以及其他軟件包的基礎上構建):
library(rvest)
pg <- html("https://www.zomato.com/ncr/restaurants/north-indian")
# extract all the restaurant names
pg %>% html_nodes("a.result-title") %>% html_text()
## [1] "Bukhara - ITC Maurya " "Karim's "
## [3] "Gulati " "Dhaba By Claridges "
## ...
## [27] "Dum-Pukht - ITC Maurya " "Maal Gaadi "
## [29] "Sahib Sindh Sultan " "My Bar & Restaurant "
# extract the ratings
pg %>% html_nodes("div.rating-div") %>% html_text() %>% gsub("[[:space:]]", "", .)
## [1] "4.3" "4.1" "4.2" "3.9" "3.8" "4.1" "4.1" "3.4" "4.1" "4.3" "4.2" "4.2" "3.9" "3.8" "3.8" "3.4" "4.0" "3.7" "4.1"
## [20] "4.0" "3.8" "3.8" "3.9" "3.8" "4.0" "4.0" "4.7" "3.8" "3.8" "3.4"
我建議從RCurl
包中獲取getURL
以獲取文檔內容。 然后,我們可以使用htmlParse
進行解析。 有時htmlParse
在處理某些內容時會遇到麻煩。 在這種情況下,建議使用getURL
。
url <- "https://www.zomato.com/ncr/restaurants/north-indian"
library(RCurl)
library(XML)
content <- getURL(url)
doc <- htmlParse(content)
summary(doc)
# $nameCounts
#
# div a li span input article h3 meta
# 1337 362 232 212 33 30 30 27
# img script ul link section p br form
# 26 21 20 17 7 6 3 3
# body footer h1 head header html noscript ol
# 1 1 1 1 1 1 1 1
# strong textarea title
# 1 1 1
#
# $numNodes
# [1] 2377
另外,請注意, readLines
不支持https
,因此錯誤消息的影響少一些。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.