使用嵌套for循環進行刮擦在R中表現出奇怪的行為

Question

我想刮曲棍球數據2000-2001 ， 2001-2002 ， 2002-2003賽季，在每個賽季中含有分布在多個頁面的表格。 這是我的抓取功能（ ushl_scrape ）：

ushl_scrape <- function(season, page) {

  # Set url of webpage
  custom_url <- paste0("https://www.eliteprospects.com/league/ushl/stats/", season, "?sort=ppg&page=", page)

  # Scrape
  url <- read_html(custom_url)

  ushl <- url %>% 
    html_node(xpath = "/html/body/section[2]/div/div[1]/div[4]/div[3]/div[1]/div/div[4]/table") %>% 
    html_table() %>% 
    filter(Player != "") %>% 
    mutate(season = season)

  # Return table
  ushl
}

然后我用這個循環來運行ushl_scrape 3個不同的季節。 為了解釋這個for循環，因為我不知道每個季節分發了多少頁數據，我在1:10頁上搜索數據，一旦我點擊了0行的頁面，我就轉到下一年

# Total years
total_years <- paste0(2000:2002, "-", 2001:2003)

# Page
page_num <- c(1:10)

final_list <- vector("list", length = length(total_years))
by_year <- vector("list")


for (ii in seq_along(total_years)) {

  # Sleep for 2 seconds to not bombard server
  Sys.sleep(2)

  for (jj in seq_along(page_num)) {

    Sys.sleep(2)

    # Scrape season[ii] and page_num[jj]
    scraped_table <- ushl_scrape(season = total_years[ii], page = page_num[jj])

    # If scraped table has no rows, exit for loop!
    if (nrow(scraped_table) == 0) {
      break
    } else{
      by_year[[jj]] <- scraped_table
    }
  }

  # Store final_df inside final_list
  final_df <- bind_rows(by_year)
  final_list[[ii]] <- final_df

}

# Finally, bind rows all the elements in list
scraped_df <- bind_rows(final_list)

在scraped_df ，我看到了所有三個季節的數據，但最后，我看到重復的2001-2002賽季數據增加......

為什么我的for循環在最后添加了2001-2002賽季的數據？
我該如何解決？

Answer 1

是的，有一些行重復。 按原樣運行代碼會產生46個重復的行。

sum(duplicated(scraped_df))
#[1] 46

問題是你必須為你的外部for循環中的每個total_year初始化by_year 。 由於您沒有這樣做，因此不會清除上一次迭代中的by_year值，從而導致重復。

for (ii in seq_along(total_years)) {

  # Sleep for 2 seconds to not bombard server
  Sys.sleep(2)
  by_year <- vector("list") # <- Added this line
  for (jj in seq_along(page_num)) {    
      Sys.sleep(2)

     # Scrape season[ii] and page_num[jj]
     scraped_table <- ushl_scrape(season = total_years[ii], page = page_num[jj])
      #browser()
     # If scraped table has no rows, exit for loop!
     if (nrow(scraped_table) == 0) {
        break
     } else{
         by_year[[jj]] <- scraped_table
     }
   }

 # Store final_df inside final_list
 final_df <- bind_rows(by_year)
 final_list[[ii]] <- final_df

}

scraped_df <- bind_rows(final_list)

我們現在可以檢查重復的行

sum(duplicated(scraped_df))
#[1] 0

使用嵌套for循環進行刮擦在R中表現出奇怪的行為

問題描述

1 個解決方案

解決方案1
0 已采納 2019-04-18 10:38:11

使用嵌套for循環進行刮擦在R中表現出奇怪的行為

問題描述

1 個解決方案

解決方案1 0 已采納 2019-04-18 10:38:11

解決方案1
0 已采納 2019-04-18 10:38:11