使用 rvest 和 R 抓取网页 html

Question

I would like to web scrape this web site https://www.askramar.com/Ponuda .我想抓取这个网站https://www.askramar.com/Ponuda 。 First, I should scrape all the links that lead to each car page.首先，我应该抓取通向每个汽车页面的所有链接。 The extended links look like this in the html structure:扩展链接在 html 结构中如下所示：

I tried the following code but I get an empty object in R:我尝试了以下代码，但在 R 中得到一个空对象：

url <- "https://www.askramar.com/Ponuda"
html_document <- read_html(url)


links <- html_document %>%
  html_nodes(xpath = '//*[contains(concat(" ", @class, " "), concat(" ", "vozilo", " "))]') %>%
  html_attr(name = "href")

Is it javascript on web page?它是网页上的javascript吗？ Please help!请帮忙！ Thanks!谢谢！

Answer 1

Yes, the page uses javascript to load the contents you are interested in. However, it does this simply by calling an xhr GET request to https://www.askramar.com/Ajax/GetResults.cshtml .是的，该页面使用 javascript 加载您感兴趣的内容。但是，它只需调用一个 xhr GET 请求到https://www.askramar.com/Ajax/GetResults.cshtml 。 You can do the same:你也可以做到的：

url <- "https://www.askramar.com/Ajax/GetResults.cshtml?stranica="

links <- list()
for(i in 1:45)
{
  links[[i]] <- httr::GET(paste0(url, i - 1)) %>% read_html %>%
  html_nodes(xpath = '//a[contains(@href, "Vozilo")]') %>%
  html_attr(name = "href")
}

links <- do.call("c", links)

print(links)


# [1] "Vozilo?id=17117" "Vozilo?id=17414" "Vozilo?id=17877" "Vozilo?id=17834"
# [5] "Vozilo?id=17999" "Vozilo?id=18395" "Vozilo?id=17878" "Vozilo?id=16256"
# [9] "Vozilo?id=17465" "Vozilo?id=17560" "Vozilo?id=17912" "Vozilo?id=18150"
#[13] "Vozilo?id=18131" "Vozilo?id=17397" "Vozilo?id=18222" "Vozilo?id=17908"
#[17] "Vozilo?id=18333" "Vozilo?id=17270" "Vozilo?id=18105" "Vozilo?id=16803"
#[21] "Vozilo?id=16804" "Vozilo?id=17278" "Vozilo?id=17887" "Vozilo?id=17939"
# ...plus 1037 further elements

Answer 2

If you inspect the network on the page, you see it sends GET requests with many query parameters, the most important 'stranice'.如果您检查页面上的网络，您会看到它发送带有许多查询参数的 GET 请求，最重要的“stranice”。 Using the above information I did the following:使用上述信息，我做了以下事情：

library(rvest)

stranice <- 1:3

askramar_scrap <- function(stranica) {
  url <- paste0("https://www.askramar.com/Ajax/GetResults.cshtml?stanje=&filter=&lokacija=&", 
                "pojam=&marka=&model=&godinaOd=&godinaDo=&cijenaOd=&cijenaDo=&snagaOd=&snagaDo=&", 
                "karoserija=&mjenjac=&boja=&pogon4x4=&sifra=&stranica=", stranica, "&sort=")
  html_document <- read_html(url)
  links <- html_document %>%
    html_nodes(xpath = '//a[contains(@href, "Vozilo")]') %>%
    html_attr(name = "href")
}

links <- lapply(stranice, askramar_scrap)
links <- unlist(links)
links <- unique(links)

Hope that is what you need.希望这就是你所需要的。

使用 rvest 和 R 抓取网页 html

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-12-30 16:10:36

解决方案2
1 2019-12-30 17:18:10

使用 rvest 和 R 抓取网页 html

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-12-30 16:10:36

解决方案2 1 2019-12-30 17:18:10

解决方案1
2 已采纳 2019-12-30 16:10:36

解决方案2
1 2019-12-30 17:18:10