如何在 R 中使用 rvest 存儲循環的結果以進行網絡抓取

Question

我正在嘗試從同一網站但在不同的選項卡中導入數據庫。

# webscraping para idh

algo <- c(1996:2017)

idh_link <- c(paste0("https://datosmacro.expansion.com/idh?anio=", 1996:2017))
final <- vector(length = length(idh_link))

for (i in seq_along(algo)) {
idh_desc <- read_html(idh_link[i])

pais <- idh_desc %>% 
  html_nodes("td:nth-child(1), .header:nth-child(1)") %>% 
  html_text()

idhaño <- idh_desc %>% 
  html_nodes("td:nth-child(2), .header:nth-child(2)") %>% 
  html_text()

final[i] <- tibble(pais, idhaño)
}

在這種情況下，它只從第一個鏈接中恢復信息，而不是在循環結束時創建小標題（想法是對所有小標題進行內部連接）。

我正在使用library(rvest)進行網絡抓取

Answer 1

向量無法存儲 data.frames/tibbles。 向量只能存儲原子對象，例如整數、字符串等。

要存儲一系列數據幀，最好使用列表。

algo <- c(1996:2017)

idh_link <- c(paste0("https://datosmacro.expansion.com/idh?anio=", 1996:2017))
#data structure to store a series of data frames
final <- list()

for (i in seq_along(algo)) {
   idh_desc <- read_html(idh_link[i])
   
   pais <- idh_desc %>% 
      html_nodes("td:nth-child(1), .header:nth-child(1)") %>% 
      html_text()
   
   idhaño <- idh_desc %>% 
      html_nodes("td:nth-child(2), .header:nth-child(2)") %>% 
      html_text()
   
   #name the list elements with the year information
   final[[as.character(algo[i])]] <- tibble(pais, idhaño)

   #add a pause so not to "attack" the server
   Sys.sleep(1)
}

要組合存儲在列表中的所有數據幀，我建議使用 dplyr package 中的bind_rows()或bind_cols() 。

如何在 R 中使用 rvest 存儲循環的結果以進行網絡抓取

問題描述

1 個解決方案

解決方案1
0 2021-05-16 19:00:41

如何在 R 中使用 rvest 存儲循環的結果以進行網絡抓取

問題描述

1 個解決方案

解決方案1 0 2021-05-16 19:00:41

解決方案1
0 2021-05-16 19:00:41