[英]Scraping with nested for loops exhibits weird behavior in R
我想刮曲棍球數據2000-2001
, 2001-2002
, 2002-2003
賽季,在每個賽季中含有分布在多個頁面的表格。 這是我的抓取功能( ushl_scrape
):
ushl_scrape <- function(season, page) {
# Set url of webpage
custom_url <- paste0("https://www.eliteprospects.com/league/ushl/stats/", season, "?sort=ppg&page=", page)
# Scrape
url <- read_html(custom_url)
ushl <- url %>%
html_node(xpath = "/html/body/section[2]/div/div[1]/div[4]/div[3]/div[1]/div/div[4]/table") %>%
html_table() %>%
filter(Player != "") %>%
mutate(season = season)
# Return table
ushl
}
然后我用這個循環來運行ushl_scrape
3個不同的季節。 為了解釋這個for循環,因為我不知道每個季節分發了多少頁數據,我在1:10頁上搜索數據,一旦我點擊了0行的頁面,我就轉到下一年
# Total years
total_years <- paste0(2000:2002, "-", 2001:2003)
# Page
page_num <- c(1:10)
final_list <- vector("list", length = length(total_years))
by_year <- vector("list")
for (ii in seq_along(total_years)) {
# Sleep for 2 seconds to not bombard server
Sys.sleep(2)
for (jj in seq_along(page_num)) {
Sys.sleep(2)
# Scrape season[ii] and page_num[jj]
scraped_table <- ushl_scrape(season = total_years[ii], page = page_num[jj])
# If scraped table has no rows, exit for loop!
if (nrow(scraped_table) == 0) {
break
} else{
by_year[[jj]] <- scraped_table
}
}
# Store final_df inside final_list
final_df <- bind_rows(by_year)
final_list[[ii]] <- final_df
}
# Finally, bind rows all the elements in list
scraped_df <- bind_rows(final_list)
在scraped_df
,我看到了所有三個季節的數據,但最后,我看到重復的2001-2002
賽季數據增加......
是的,有一些行重復。 按原樣運行代碼會產生46個重復的行。
sum(duplicated(scraped_df))
#[1] 46
問題是你必須為你的外部for
循環中的每個total_year
初始化by_year
。 由於您沒有這樣做,因此不會清除上一次迭代中的by_year
值,從而導致重復。
for (ii in seq_along(total_years)) {
# Sleep for 2 seconds to not bombard server
Sys.sleep(2)
by_year <- vector("list") # <- Added this line
for (jj in seq_along(page_num)) {
Sys.sleep(2)
# Scrape season[ii] and page_num[jj]
scraped_table <- ushl_scrape(season = total_years[ii], page = page_num[jj])
#browser()
# If scraped table has no rows, exit for loop!
if (nrow(scraped_table) == 0) {
break
} else{
by_year[[jj]] <- scraped_table
}
}
# Store final_df inside final_list
final_df <- bind_rows(by_year)
final_list[[ii]] <- final_df
}
scraped_df <- bind_rows(final_list)
我們現在可以檢查重復的行
sum(duplicated(scraped_df))
#[1] 0
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.