[英]How to continue scraping web data after receiving an error within a loop in R?
I am trying to scrape data from a website where I need to loop through multiple URLs.我正在尝试从需要遍历多个 URL 的网站上抓取数据。 However, some URLs are giving me an error which is ok but I need to skip to the next one.
但是,一些 URL 给了我一个错误,这没关系,但我需要跳到下一个。 I have tried using an if statement inside the loop to identify the error, however I am then getting a duplicate row in my final dataframe.
我尝试在循环中使用 if 语句来识别错误,但是我在最终的 dataframe 中得到了重复的行。 Any ideas of the best way to go out this?
关于 go 的最佳方法的任何想法? I have read about tryCatch however I am confused about how to apply it to my specific problem.
我已经阅读了有关 tryCatch 的信息,但是我对如何将其应用于我的特定问题感到困惑。 Thanks
谢谢
Here is the code:这是代码:
library(xml2)
library(rvest)
library(tibble)
library(httr)
library(stringr)
maindf = data.frame(matrix(ncol = 6, nrow = 0))
colnames(maindf) <- c("Name", "DOB", "Country", "Weight", "Height", "ID")
for(i in 1:100){
base_url <- paste0("https://worldrowing.com/athlete/",i,"")
r = GET(base_url)
status = status_code(r)
if(status != 500){
base_webpage <- read_html(base_url)
css_selector <- "#app > div.athlete.bg-dark10 > div.athlete__info.my-2.max-content-width.px-1 > div.athlete__biography.athlete-box.bg-white.mb-1 > div > section > header"
css_selector2 <- "#app > div.athlete.bg-dark10 > div.athlete__info.my-2.max-content-width.px-1 > div.athlete__biography.athlete-box.bg-white.mb-1 > div > section > ul > li.biography__dob"
css_selector3 <- "#app > div.athlete.bg-dark10 > div.athlete__info.my-2.max-content-width.px-1 > div.athlete__biography.athlete-box.bg-white.mb-1 > div > section > ul > li.biography__location"
css_selector4 <- "#app > div.athlete.bg-dark10 > div.athlete__info.my-2.max-content-width.px-1 > div.athlete__biography.athlete-box.bg-white.mb-1 > div > section > ul > li.biography__weight"
css_selector5 <- "#app > div.athlete.bg-dark10 > div.athlete__info.my-2.max-content-width.px-1 > div.athlete__biography.athlete-box.bg-white.mb-1 > div > section > ul > li.biography__height"
athleteName = base_webpage %>% html_element(css = css_selector) %>% html_text()
athleteName = str_trim(athleteName)
athleteDOB = base_webpage %>% html_element(css = css_selector2) %>% html_text()
athleteDOB = str_trim(athleteDOB)
athleteCountry = base_webpage %>% html_element(css = css_selector3) %>% html_text()
athleteCountry = str_trim(athleteCountry)
athleteWeight = base_webpage %>% html_element(css = css_selector4) %>% html_text()
athleteWeight = str_trim(athleteWeight)
athleteWeight = str_remove_all(athleteWeight, "\\D+")
athleteHeight = base_webpage %>% html_element(css = css_selector5) %>% html_text()
athleteHeight = str_trim(athleteHeight)
athleteHeight = str_remove_all(athleteHeight, "\\D+")
athleteID = str_remove_all(base_url, "\\D+")
#create dataframe
df <- list(col1 = athleteName, col2 = athleteDOB, col3 = athleteCountry, col4 = athleteWeight, col5 = athleteHeight, col6 = athleteID)
tempdf <- as.data.frame(df)
maindf = rbind(maindf,tempdf)
} else {
base_url <- paste0("https://worldrowing.com/athlete/",i+1,"")
base_webpage <- read_html(base_url)
css_selector <- "#app > div.athlete.bg-dark10 > div.athlete__info.my-2.max-content-width.px-1 > div.athlete__biography.athlete-box.bg-white.mb-1 > div > section > header"
css_selector2 <- "#app > div.athlete.bg-dark10 > div.athlete__info.my-2.max-content-width.px-1 > div.athlete__biography.athlete-box.bg-white.mb-1 > div > section > ul > li.biography__dob"
css_selector3 <- "#app > div.athlete.bg-dark10 > div.athlete__info.my-2.max-content-width.px-1 > div.athlete__biography.athlete-box.bg-white.mb-1 > div > section > ul > li.biography__location"
css_selector4 <- "#app > div.athlete.bg-dark10 > div.athlete__info.my-2.max-content-width.px-1 > div.athlete__biography.athlete-box.bg-white.mb-1 > div > section > ul > li.biography__weight"
css_selector5 <- "#app > div.athlete.bg-dark10 > div.athlete__info.my-2.max-content-width.px-1 > div.athlete__biography.athlete-box.bg-white.mb-1 > div > section > ul > li.biography__height"
athleteName = base_webpage %>% html_element(css = css_selector) %>% html_text()
athleteName = str_trim(athleteName)
athleteDOB = base_webpage %>% html_element(css = css_selector2) %>% html_text()
athleteDOB = str_trim(athleteDOB)
athleteCountry = base_webpage %>% html_element(css = css_selector3) %>% html_text()
athleteCountry = str_trim(athleteCountry)
athleteWeight = base_webpage %>% html_element(css = css_selector4) %>% html_text()
athleteWeight = str_trim(athleteWeight)
athleteWeight = str_remove_all(athleteWeight, "\\D+")
athleteHeight = base_webpage %>% html_element(css = css_selector5) %>% html_text()
athleteHeight = str_trim(athleteHeight)
athleteHeight = str_remove_all(athleteHeight, "\\D+")
athleteID = str_remove_all(base_url, "\\D+")
#create dataframe
df <- list(col1 = athleteName, col2 = athleteDOB, col3 = athleteCountry, col4 = athleteWeight, col5 = athleteHeight, col6 = athleteID)
tempdf <- as.data.frame(df)
maindf = rbind(maindf,tempdf)
}
}
With purrr
s method of looping.用
purrr
的循环方法。 Perhaps you find this more readable and manipulable.也许您会发现这更具可读性和可操作性。 It skips non-existing IDs with
possibly
as shown in test_index
with numbers like 9191919 and 8888888它会跳过不存在的 ID,
possibly
如test_index
所示,数字为 9191919 和 8888888
library(tidyverse)
library(rvest)
get_athlete <- function(id) {
cat("Scraping ID", id, "\n")
page <- paste0("https://worldrowing.com/athlete/", id) %>%
read_html
data <- tibble(
name = page %>%
html_element(".tf-allcaps.tf-bold") %>%
html_text2(),
dob = page %>%
html_element(".biography__dob") %>%
html_text2() %>%
as.Date("%d/%m/%Y"),
country = page %>%
html_element(".biography__location") %>%
html_text2(),
weight = page %>%
html_element(".biography__weight") %>%
html_text2(),
height = page %>%
html_element(".biography__height") %>%
html_text2(),
athlete_id = id
)
}
test_index <- c(101, 2, 9191919, 3, 8888888, 5, 6, 100, 1, 1)
df <- map_df(test_index, possibly(get_athlete, quiet = TRUE,
otherwise = NULL))
# A tibble: 8 x 6
name dob country weight height athlete_id
<chr> <date> <chr> <chr> <chr> <dbl>
1 Constanze Ahrendt 1973-01-09 Germany 59kg 175cm 101
2 Jan Aakernes 1899-12-30 Norway NA NA 2
3 Ville Aaltonen 1974-11-18 Finland NA NA 3
4 Robert Williams 1959-04-05 Great Britain NA NA 5
5 Soeren Aasmul 1899-12-30 Denmark NA NA 6
6 Bernd Ahrendt 1899-12-30 Germany NA NA 100
7 Jesper A. Aagard 1985-09-07 Denmark 47kg 160cm 1
8 Jesper A. Aagard 1985-09-07 Denmark 47kg 160cm 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.