簡體   English   中英

在 R 的循環內收到錯誤后如何繼續抓取 web 數據?

[英]How to continue scraping web data after receiving an error within a loop in R?

我正在嘗試從需要遍歷多個 URL 的網站上抓取數據。 但是,一些 URL 給了我一個錯誤,這沒關系,但我需要跳到下一個。 我嘗試在循環中使用 if 語句來識別錯誤,但是我在最終的 dataframe 中得到了重復的行。 關於 go 的最佳方法的任何想法? 我已經閱讀了有關 tryCatch 的信息,但是我對如何將其應用於我的特定問題感到困惑。 謝謝

這是代碼:

library(xml2)
library(rvest)
library(tibble)
library(httr)
library(stringr)

maindf = data.frame(matrix(ncol = 6, nrow = 0))
colnames(maindf) <- c("Name", "DOB", "Country", "Weight", "Height", "ID")

for(i in 1:100){
  
base_url <- paste0("https://worldrowing.com/athlete/",i,"")

r = GET(base_url)
status = status_code(r)
if(status != 500){

  base_webpage <- read_html(base_url)

  css_selector <- "#app > div.athlete.bg-dark10 > div.athlete__info.my-2.max-content-width.px-1 > div.athlete__biography.athlete-box.bg-white.mb-1 > div > section > header"
  css_selector2 <- "#app > div.athlete.bg-dark10 > div.athlete__info.my-2.max-content-width.px-1 > div.athlete__biography.athlete-box.bg-white.mb-1 > div > section > ul > li.biography__dob"
  css_selector3 <- "#app > div.athlete.bg-dark10 > div.athlete__info.my-2.max-content-width.px-1 > div.athlete__biography.athlete-box.bg-white.mb-1 > div > section > ul > li.biography__location"
  css_selector4 <- "#app > div.athlete.bg-dark10 > div.athlete__info.my-2.max-content-width.px-1 > div.athlete__biography.athlete-box.bg-white.mb-1 > div > section > ul > li.biography__weight"
  css_selector5 <- "#app > div.athlete.bg-dark10 > div.athlete__info.my-2.max-content-width.px-1 > div.athlete__biography.athlete-box.bg-white.mb-1 > div > section > ul > li.biography__height"

  athleteName = base_webpage %>% html_element(css = css_selector) %>% html_text()
  athleteName = str_trim(athleteName)

  athleteDOB = base_webpage %>% html_element(css = css_selector2) %>% html_text()
  athleteDOB = str_trim(athleteDOB)

  athleteCountry = base_webpage %>% html_element(css = css_selector3) %>% html_text()
  athleteCountry = str_trim(athleteCountry)

  athleteWeight = base_webpage %>% html_element(css = css_selector4) %>% html_text()
  athleteWeight = str_trim(athleteWeight)
  athleteWeight = str_remove_all(athleteWeight, "\\D+")

  athleteHeight = base_webpage %>% html_element(css = css_selector5) %>% html_text()
  athleteHeight = str_trim(athleteHeight)
  athleteHeight = str_remove_all(athleteHeight, "\\D+")

  athleteID = str_remove_all(base_url, "\\D+")


  #create dataframe
  df <- list(col1 = athleteName, col2 = athleteDOB, col3 = athleteCountry, col4 = athleteWeight, col5 = athleteHeight, col6 = athleteID)

  tempdf <- as.data.frame(df)

  maindf = rbind(maindf,tempdf)

   } else {
  
  base_url <- paste0("https://worldrowing.com/athlete/",i+1,"")
  
  base_webpage <- read_html(base_url)
  
  css_selector <- "#app > div.athlete.bg-dark10 > div.athlete__info.my-2.max-content-width.px-1 > div.athlete__biography.athlete-box.bg-white.mb-1 > div > section > header"
  css_selector2 <- "#app > div.athlete.bg-dark10 > div.athlete__info.my-2.max-content-width.px-1 > div.athlete__biography.athlete-box.bg-white.mb-1 > div > section > ul > li.biography__dob" 
  css_selector3 <- "#app > div.athlete.bg-dark10 > div.athlete__info.my-2.max-content-width.px-1 > div.athlete__biography.athlete-box.bg-white.mb-1 > div > section > ul > li.biography__location"
  css_selector4 <- "#app > div.athlete.bg-dark10 > div.athlete__info.my-2.max-content-width.px-1 > div.athlete__biography.athlete-box.bg-white.mb-1 > div > section > ul > li.biography__weight"
  css_selector5 <- "#app > div.athlete.bg-dark10 > div.athlete__info.my-2.max-content-width.px-1 > div.athlete__biography.athlete-box.bg-white.mb-1 > div > section > ul > li.biography__height"
  
  athleteName = base_webpage %>% html_element(css = css_selector) %>% html_text()
  athleteName = str_trim(athleteName)
  
  athleteDOB = base_webpage %>% html_element(css = css_selector2) %>% html_text()
  athleteDOB = str_trim(athleteDOB)
  
  athleteCountry = base_webpage %>% html_element(css = css_selector3) %>% html_text()
  athleteCountry = str_trim(athleteCountry)
  
  athleteWeight = base_webpage %>% html_element(css = css_selector4) %>% html_text()
  athleteWeight = str_trim(athleteWeight)
  athleteWeight = str_remove_all(athleteWeight, "\\D+")
  
  athleteHeight = base_webpage %>% html_element(css = css_selector5) %>% html_text()
  athleteHeight = str_trim(athleteHeight)
  athleteHeight = str_remove_all(athleteHeight, "\\D+")
  
  athleteID = str_remove_all(base_url, "\\D+")
  
  
  #create dataframe
  df <- list(col1 = athleteName, col2 = athleteDOB, col3 = athleteCountry, col4 = athleteWeight, col5 = athleteHeight, col6 = athleteID)
  
  tempdf <- as.data.frame(df)
 
  maindf = rbind(maindf,tempdf)
}

}

purrr的循環方法。 也許您會發現這更具可讀性和可操作性。 它會跳過不存在的 ID, possiblytest_index所示,數字為 9191919 和 8888888

library(tidyverse)
library(rvest)

get_athlete <- function(id) {
  cat("Scraping ID", id, "\n")
  
  page <- paste0("https://worldrowing.com/athlete/", id) %>%
    read_html
  
  data <- tibble(
    name = page %>%
      html_element(".tf-allcaps.tf-bold") %>%
      html_text2(),
    dob = page %>%
      html_element(".biography__dob") %>%
      html_text2() %>%
      as.Date("%d/%m/%Y"),
    country = page %>%
      html_element(".biography__location") %>%
      html_text2(),
    weight = page %>%
      html_element(".biography__weight") %>%
      html_text2(),
    height = page %>%
      html_element(".biography__height") %>%
      html_text2(),
    athlete_id = id
  )
}

test_index <- c(101, 2, 9191919, 3, 8888888, 5, 6, 100, 1, 1)

df <- map_df(test_index, possibly(get_athlete, quiet = TRUE, 
                            otherwise = NULL))

# A tibble: 8 x 6
  name              dob        country       weight height athlete_id
  <chr>             <date>     <chr>         <chr>  <chr>       <dbl>
1 Constanze Ahrendt 1973-01-09 Germany       59kg   175cm         101
2 Jan Aakernes      1899-12-30 Norway        NA     NA              2
3 Ville Aaltonen    1974-11-18 Finland       NA     NA              3
4 Robert Williams   1959-04-05 Great Britain NA     NA              5
5 Soeren Aasmul     1899-12-30 Denmark       NA     NA              6
6 Bernd Ahrendt     1899-12-30 Germany       NA     NA            100
7 Jesper A. Aagard  1985-09-07 Denmark       47kg   160cm           1
8 Jesper A. Aagard  1985-09-07 Denmark       47kg   160cm           1

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM