[英]Optimize web scraping with Rselenium
I am doing some web scraping on a dynamic webpage and would like to optimize the process since it is very slow.我正在动态网页上进行一些 web 抓取,并希望优化该过程,因为它非常慢。 The webpage displays a series of sales with information and as one scrolls down more sales show up, although there is a finite number of sales.该网页显示了一系列带有信息的销售,并且随着向下滚动显示更多销售,尽管销售数量是有限的。 What I did is to increase the window size so it would load almost every sale without scrolling.我所做的是增加 window 的大小,这样它几乎可以在不滚动的情况下加载所有销售。 However, this takes a while to load since there is a lot of information, and images.但是,这需要一段时间才能加载,因为有很多信息和图像。 The information that I am extracting is the price, the asset name, and the link associated with the asset (when you click on the image).我提取的信息是价格、资产名称以及与资产关联的链接(当您单击图像时)。
My goal is to optimize this process as much as possible.我的目标是尽可能优化这个过程。 One way to do so would be not to load the images since I don't need them, but I could not find a way to do so with Firefox.这样做的一种方法是不加载图像,因为我不需要它们,但我找不到使用 Firefox 的方法。
Any improvement would be greatly appreciated.任何改进将不胜感激。
library(RSelenium)
library(rvest)
url <- "https://cnft.io/marketplace?project=Boss%20Cat%20Rocket%20Club&sort=_id:-1&type=listing,offer"
exCap <- list("moz:firefoxOptions" = list(args = list('--headless'))) # Hide browser --headless
rD <- rsDriver(browser = "firefox", port = as.integer(sample(4000:4700, 1)),
verbose = FALSE, extraCapabilities = exCap)
remDr <- rD[["client"]]
remDr$setWindowSize(30000, 30000)
remDr$navigate(url)
Sys.sleep(300)
html <- remDr$getPageSource()[[1]]
remDr$close()
html <- read_html(html)
Well, after some digging through that website, I found an API for all the listings: https://api.cnft.io/market/listings .好吧,在浏览了该网站之后,我找到了所有列表的API: https://api.cnft.io/market/listings 。 It takes a POST request and will return paginated JSON strings.它接受一个 POST 请求并将返回分页的 JSON 字符串。 We can use httr
to send such requests.我们可以使用httr
来发送这样的请求。 Here is a small script for your web scrapping task.这是您的 web 报废任务的小脚本。
api_link <- "https://api.cnft.io/market/listings"
project <- "Boss Cat Rocket Club"
query <- function(page, url, project) {
httr::content(httr::POST(
url = url,
body = list(
search = "",
types = c("listing", "offer"),
project = project,
sort = list(`_id` = -1L),
priceMin = NULL,
priceMax = NULL,
page = page,
verified = TRUE,
nsfw = FALSE,
sold = FALSE,
smartContract = FALSE
),
encode = "json"
), simplifyVector = TRUE)
}
query_all <- function(url, project) {
n <- query(1L, url, project)[["count"]]
out <- vector("list", n)
offset <- 0L
for (i in seq_len(n)) {
out[[i]] <- query(i, url, project)[["results"]]
if (length(out[[i]]) < 1L) {
offset <- -1L
break
}
}
out[seq_len(i + offset)]
}
collect_data <- function(results) {
dplyr::tibble(
asset_id = results[["asset"]][["assetId"]],
price = results[["price"]],
link = paste0("https://cnft.io/token/", results[["_id"]])
)
}
system.time(
dt <- query_all(api_link, project) |> lapply(collect_data) |> dplyr::bind_rows()
)
dt
Output (it takes about 12 seconds to finish) Output(大约需要12秒完成)
> system.time(
+ dt <- query_all(api_link, project) |> lapply(collect_data) |> dplyr::bind_rows()
+ )
user system elapsed
0.78 0.00 12.33
> dt
# A tibble: 2,161 x 3
asset_id price link
<chr> <dbl> <chr>
1 BossCatRocketClub1373 222000000 https://cnft.io/token/61ce22eb4185f57d50190079
2 BossCatRocketClub4639 380000000 https://cnft.io/token/61ce229b9163f2db80db98fe
3 BossCatRocketClub5598 505000000 https://cnft.io/token/61ce22954185f57d5018e2ff
4 BossCatRocketClub2673 187000000 https://cnft.io/token/61ce2281ceed93ea12ae32ec
5 BossCatRocketClub1721 350000000 https://cnft.io/token/61ce2281398627cc52c5844c
6 BossCatRocketClub673 300000000 https://cnft.io/token/61ce22724185f57d5018d645
7 BossCatRocketClub5915 200000000000 https://cnft.io/token/61ce2241398627cc52c56eae
8 BossCatRocketClub5699 350000000 https://cnft.io/token/61ce21fa398627cc52c55644
9 BossCatRocketClub4570 350000000 https://cnft.io/token/61ce21ef4185f57d5018a9d4
10 BossCatRocketClub6125 250000000 https://cnft.io/token/61ce21e49163f2db80db58dd
# ... with 2,151 more rows
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.