[英]Web scraping html with rvest and R
I would like to web scrape this web site https://www.askramar.com/Ponuda .我想抓取这个网站https://www.askramar.com/Ponuda 。 First, I should scrape all the links that lead to each car page.
首先,我应该抓取通向每个汽车页面的所有链接。 The extended links look like this in the html structure:
扩展链接在 html 结构中如下所示:
I tried the following code but I get an empty object in R:我尝试了以下代码,但在 R 中得到一个空对象:
url <- "https://www.askramar.com/Ponuda"
html_document <- read_html(url)
links <- html_document %>%
html_nodes(xpath = '//*[contains(concat(" ", @class, " "), concat(" ", "vozilo", " "))]') %>%
html_attr(name = "href")
Is it javascript on web page?它是网页上的javascript吗? Please help!
请帮忙! Thanks!
谢谢!
Yes, the page uses javascript to load the contents you are interested in. However, it does this simply by calling an xhr GET request to https://www.askramar.com/Ajax/GetResults.cshtml
.是的,该页面使用 javascript 加载您感兴趣的内容。但是,它只需调用一个 xhr GET 请求到
https://www.askramar.com/Ajax/GetResults.cshtml
。 You can do the same:你也可以做到的:
url <- "https://www.askramar.com/Ajax/GetResults.cshtml?stranica="
links <- list()
for(i in 1:45)
{
links[[i]] <- httr::GET(paste0(url, i - 1)) %>% read_html %>%
html_nodes(xpath = '//a[contains(@href, "Vozilo")]') %>%
html_attr(name = "href")
}
links <- do.call("c", links)
print(links)
# [1] "Vozilo?id=17117" "Vozilo?id=17414" "Vozilo?id=17877" "Vozilo?id=17834"
# [5] "Vozilo?id=17999" "Vozilo?id=18395" "Vozilo?id=17878" "Vozilo?id=16256"
# [9] "Vozilo?id=17465" "Vozilo?id=17560" "Vozilo?id=17912" "Vozilo?id=18150"
#[13] "Vozilo?id=18131" "Vozilo?id=17397" "Vozilo?id=18222" "Vozilo?id=17908"
#[17] "Vozilo?id=18333" "Vozilo?id=17270" "Vozilo?id=18105" "Vozilo?id=16803"
#[21] "Vozilo?id=16804" "Vozilo?id=17278" "Vozilo?id=17887" "Vozilo?id=17939"
# ...plus 1037 further elements
If you inspect the network on the page, you see it sends GET requests with many query parameters, the most important 'stranice'.如果您检查页面上的网络,您会看到它发送带有许多查询参数的 GET 请求,最重要的“stranice”。 Using the above information I did the following:
使用上述信息,我做了以下事情:
library(rvest)
stranice <- 1:3
askramar_scrap <- function(stranica) {
url <- paste0("https://www.askramar.com/Ajax/GetResults.cshtml?stanje=&filter=&lokacija=&",
"pojam=&marka=&model=&godinaOd=&godinaDo=&cijenaOd=&cijenaDo=&snagaOd=&snagaDo=&",
"karoserija=&mjenjac=&boja=&pogon4x4=&sifra=&stranica=", stranica, "&sort=")
html_document <- read_html(url)
links <- html_document %>%
html_nodes(xpath = '//a[contains(@href, "Vozilo")]') %>%
html_attr(name = "href")
}
links <- lapply(stranice, askramar_scrap)
links <- unlist(links)
links <- unique(links)
Hope that is what you need.希望这就是你所需要的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.