简体   繁体   中英

Web scraping html with rvest and R

I would like to web scrape this web site https://www.askramar.com/Ponuda . First, I should scrape all the links that lead to each car page. The extended links look like this in the html structure:

在此处输入图片说明

I tried the following code but I get an empty object in R:

url <- "https://www.askramar.com/Ponuda"
html_document <- read_html(url)


links <- html_document %>%
  html_nodes(xpath = '//*[contains(concat(" ", @class, " "), concat(" ", "vozilo", " "))]') %>%
  html_attr(name = "href") 

Is it javascript on web page? Please help! Thanks!

Yes, the page uses javascript to load the contents you are interested in. However, it does this simply by calling an xhr GET request to https://www.askramar.com/Ajax/GetResults.cshtml . You can do the same:

url <- "https://www.askramar.com/Ajax/GetResults.cshtml?stranica="

links <- list()
for(i in 1:45)
{
  links[[i]] <- httr::GET(paste0(url, i - 1)) %>% read_html %>%
  html_nodes(xpath = '//a[contains(@href, "Vozilo")]') %>%
  html_attr(name = "href")
}

links <- do.call("c", links)

print(links)


# [1] "Vozilo?id=17117" "Vozilo?id=17414" "Vozilo?id=17877" "Vozilo?id=17834"
# [5] "Vozilo?id=17999" "Vozilo?id=18395" "Vozilo?id=17878" "Vozilo?id=16256"
# [9] "Vozilo?id=17465" "Vozilo?id=17560" "Vozilo?id=17912" "Vozilo?id=18150"
#[13] "Vozilo?id=18131" "Vozilo?id=17397" "Vozilo?id=18222" "Vozilo?id=17908"
#[17] "Vozilo?id=18333" "Vozilo?id=17270" "Vozilo?id=18105" "Vozilo?id=16803"
#[21] "Vozilo?id=16804" "Vozilo?id=17278" "Vozilo?id=17887" "Vozilo?id=17939"
# ...plus 1037 further elements

If you inspect the network on the page, you see it sends GET requests with many query parameters, the most important 'stranice'. Using the above information I did the following:

library(rvest)

stranice <- 1:3

askramar_scrap <- function(stranica) {
  url <- paste0("https://www.askramar.com/Ajax/GetResults.cshtml?stanje=&filter=&lokacija=&", 
                "pojam=&marka=&model=&godinaOd=&godinaDo=&cijenaOd=&cijenaDo=&snagaOd=&snagaDo=&", 
                "karoserija=&mjenjac=&boja=&pogon4x4=&sifra=&stranica=", stranica, "&sort=")
  html_document <- read_html(url)
  links <- html_document %>%
    html_nodes(xpath = '//a[contains(@href, "Vozilo")]') %>%
    html_attr(name = "href")
}

links <- lapply(stranice, askramar_scrap)
links <- unlist(links)
links <- unique(links)

Hope that is what you need.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM