简体   繁体   English

使用 Rselenium 优化 web 刮擦

[英]Optimize web scraping with Rselenium

I am doing some web scraping on a dynamic webpage and would like to optimize the process since it is very slow.我正在动态网页上进行一些 web 抓取,并希望优化该过程,因为它非常慢。 The webpage displays a series of sales with information and as one scrolls down more sales show up, although there is a finite number of sales.该网页显示了一系列带有信息的销售,并且随着向下滚动显示更多销售,尽管销售数量是有限的。 What I did is to increase the window size so it would load almost every sale without scrolling.我所做的是增加 window 的大小,这样它几乎可以在不滚动的情况下加载所有销售。 However, this takes a while to load since there is a lot of information, and images.但是,这需要一段时间才能加载,因为有很多信息和图像。 The information that I am extracting is the price, the asset name, and the link associated with the asset (when you click on the image).我提取的信息是价格、资产名称以及与资产关联的链接(当您单击图像时)。

My goal is to optimize this process as much as possible.我的目标是尽可能优化这个过程。 One way to do so would be not to load the images since I don't need them, but I could not find a way to do so with Firefox.这样做的一种方法是不加载图像,因为我不需要它们,但我找不到使用 Firefox 的方法。

Any improvement would be greatly appreciated.任何改进将不胜感激。

library(RSelenium)
library(rvest)

url <- "https://cnft.io/marketplace?project=Boss%20Cat%20Rocket%20Club&sort=_id:-1&type=listing,offer"

exCap <- list("moz:firefoxOptions" = list(args = list('--headless'))) # Hide browser --headless
rD <- rsDriver(browser = "firefox", port = as.integer(sample(4000:4700, 1)),
               verbose = FALSE, extraCapabilities = exCap)
remDr <- rD[["client"]]
remDr$setWindowSize(30000, 30000)
remDr$navigate(url)
Sys.sleep(300)
html <- remDr$getPageSource()[[1]]
remDr$close()

html <- read_html(html)

Well, after some digging through that website, I found an API for all the listings: https://api.cnft.io/market/listings .好吧,在浏览了该网站之后,我找到了所有列表的API: https://api.cnft.io/market/listings It takes a POST request and will return paginated JSON strings.它接受一个 POST 请求并将返回分页的 JSON 字符串。 We can use httr to send such requests.我们可以使用httr来发送这样的请求。 Here is a small script for your web scrapping task.这是您的 web 报废任务的小脚本。

api_link <- "https://api.cnft.io/market/listings"
project <- "Boss Cat Rocket Club"

query <- function(page, url, project) {
  httr::content(httr::POST(
    url = url, 
    body = list(
      search = "", 
      types = c("listing", "offer"), 
      project = project, 
      sort = list(`_id` = -1L), 
      priceMin = NULL, 
      priceMax = NULL, 
      page = page, 
      verified = TRUE, 
      nsfw = FALSE, 
      sold = FALSE, 
      smartContract = FALSE
    ), 
    encode = "json"
  ), simplifyVector = TRUE)
}

query_all <- function(url, project) {
  n <- query(1L, url, project)[["count"]]
  out <- vector("list", n)
  offset <- 0L
  for (i in seq_len(n)) {
    out[[i]] <- query(i, url, project)[["results"]]
    if (length(out[[i]]) < 1L) {
      offset <- -1L
      break
    }
  }
  out[seq_len(i + offset)]
}

collect_data <- function(results) {
  dplyr::tibble(
    asset_id = results[["asset"]][["assetId"]],
    price = results[["price"]],
    link = paste0("https://cnft.io/token/", results[["_id"]])
  )
}

system.time(
  dt <- query_all(api_link, project) |> lapply(collect_data) |> dplyr::bind_rows()  
)
dt

Output (it takes about 12 seconds to finish) Output(大约需要12秒完成)

> system.time(
+   dt <- query_all(api_link, project) |> lapply(collect_data) |> dplyr::bind_rows()  
+ )
   user  system elapsed 
   0.78    0.00   12.33 
> dt
# A tibble: 2,161 x 3
   asset_id                     price link                                          
   <chr>                        <dbl> <chr>                                         
 1 BossCatRocketClub1373    222000000 https://cnft.io/token/61ce22eb4185f57d50190079
 2 BossCatRocketClub4639    380000000 https://cnft.io/token/61ce229b9163f2db80db98fe
 3 BossCatRocketClub5598    505000000 https://cnft.io/token/61ce22954185f57d5018e2ff
 4 BossCatRocketClub2673    187000000 https://cnft.io/token/61ce2281ceed93ea12ae32ec
 5 BossCatRocketClub1721    350000000 https://cnft.io/token/61ce2281398627cc52c5844c
 6 BossCatRocketClub673     300000000 https://cnft.io/token/61ce22724185f57d5018d645
 7 BossCatRocketClub5915 200000000000 https://cnft.io/token/61ce2241398627cc52c56eae
 8 BossCatRocketClub5699    350000000 https://cnft.io/token/61ce21fa398627cc52c55644
 9 BossCatRocketClub4570    350000000 https://cnft.io/token/61ce21ef4185f57d5018a9d4
10 BossCatRocketClub6125    250000000 https://cnft.io/token/61ce21e49163f2db80db58dd
# ... with 2,151 more rows

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM