简体   繁体   中英

Web scraping with R XML package - xpathSapply

I'm trying to extract all the shopping mall names (eg CityPlaza, Fashion Walk) from this website: https://www.discoverhongkong.com/eng/explore/shopping/major-shopping-malls-throughout-city.html

Looking at the html code it looks like the text for the shopping malls are all stored under the tag "h5". Therefore I've used the following codes to try and extract but it doesn't give me the text I wanted.

url <- "https://www.discoverhongkong.com/eng/explore/shopping/major-shopping-malls-throughout-city.html"
txt = getURL(url)
PARSED <- htmlParse(txt)
mall_text <- xpathSApply(PARSED, "//h5", xmlValue)

It's certainly something to do with the path I put as argument in the xpathSApply function, given I have very little knowledge about html. Could anyone help please?

The shopping mall recommendations are loaded in dynamically, so unfortunately it is not possible to obtain them in that way.
If you right-click the web-page in your browser, go to 'Inspect element', click on the 'Network' tab and refresh the page, you can see a bunch of JSON/XHR requests being made:

XHR 请求

One of those urls is this . You can see it contains the information you want in JSON format.

This can easily be loaded in R using for example the jsonlite package.

library(jsonlite)

url <- "https://www.discoverhongkong.com/eng/explore/shopping/major-shopping-malls-throughout-city/_jcr_content/root/responsivegrid/dhkContainer/container/recommendationtiles_.recommendation-tiles.recommendationtiles_.json?path=/content/dhk/intl/en/explore/shopping/major-shopping-malls-throughout-city"
result <- read_json(url)
sapply(result$data, function(x) x$title)

Which gives

 [1] "Cityplaza"                "Fashion Walk"            
 [3] "Horizon Plaza"            "Hysan Place"             
 [5] "ifc mall"                 "Island Beverley"         
 [7] "LANDMARK"                 "Lee Garden One - Six"    
 [9] "Lee Theatre and Leighton" "Lee Tung Avenue"         
[11] "Pacific Place"            "Peak Galleria"           
[13] "SOGO Causeway Bay Store"  "Times Square"            
[15] "Western Market"           "WTC"                     

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM