[英]More efficient way to extract and subtract rows R in different dataframes
我正在將這個籃球比賽數據與大約50,000行的數據框游戲一起使用。 我正在嘗試比較每場比賽中每支球隊(A和B)的統計數據。
我還有另一個稱為teamStats的數據框,該數據框每個季節都有大約3000行,每個團隊都有。
到目前為止,我已經匯編了以下代碼:
for (i in 1:nrow(games)) {
if (length(which(((teamStats$Year == games$Season[i])==1) & (teamStats$teamID == games$teamA[i]))) == 1) {
selectTeamA <- teamStats[which(((teamStats$Year == games$Season[i])==1) & (teamStats$teamID == games$teamA[i])),4:45]
} else {
selectTeamA <- as.numeric(rep(NA, ncol(differences)))
}
if (length(which(((teamStats$Year == games$Season[i])==1) & (teamStats$teamID == games$teamB[i]))) == 1) {
selectTeamB <- teamStats[which(((teamStats$Year == games$Season[i])==1) & (teamStats$teamID == games$teamB[i])),4:45]
} else {
selectTeamB <- as.numeric(rep(NA, ncol(differences)))
}
differences[i,] <- selectTeamA - selectTeamB
}
基本上,此代碼在設置正確的賽季之后,為每個A和B團隊搜索正確的teamID。 由於每個賽季的每支球隊都沒有出現在teamstats中,因此我現在用NA填補了缺失的行。 “差異”數據框是一個空的數據框,該填充框將填充我和A隊在for循環中的狀態差異。
讓您對數據有所了解:
游戲-前6行
Season teamA teamB winner scoreA scoreB
108123 2010 1143 1293 A 75 70
108124 2010 1198 1314 B 72 88
108125 2010 1108 1326 B 60 100
108126 2010 1107 1393 B 43 75
108127 2010 1143 1178 A 95 61
teamStats-前6行,僅前6列用於空間-整個數據幀中的許多列具有不同的統計信息。 代碼為teamID找到正確的行,然后減去stat列,例如GWL等
School Year teamID G W L
1 abilene christian 2018 1101 32 16 16
2 air force 2018 1102 31 12 19
3 akron 2018 1103 32 14 18
4 alabama a&m 2018 1105 31 3 28
5 alabama-birmingham 2018 1412 33 20 13
而關閉這個很長的帖子,我的問題。 我的for循環代碼可以正常工作並填充差異數據幀。 問題是運行此代碼需要20到30分鍾。 我對處理這么多數據不是很有經驗。 有我不知道的技術嗎? 如何以更有效的方式重寫此代碼?
一種方法是合並games
和teamStats
,以替代teamStats
迭代。
一些代碼可以復制您的設置,以創建一個最小的工作示例:
library(dplyr)
library(purrr)
set.seed(123)
n_games <- 50000
n_teams <- 400
n_years <- 10
games <- data.frame(Season = rep(2005:(2005 + n_years - 1),
each = n_games / n_years)) %>%
mutate(teamA = sample(1000:(1000 + n_teams - 1), n_games, r = TRUE),
teamB = map_int(teamA, ~sample(setdiff(1000:(1000 + n_teams - 1), .), 1)),
scoreA = as.integer(rnorm(n_games, 80, 20)),
scoreB = as.integer(rnorm(n_games, 80, 20)),
scoreB = ifelse(scoreA == scoreB, scoreA + sample(c(-1, 1), n_games, r = TRUE), scoreB),
winner = ifelse(scoreA > scoreB, "A", "B"))
gen_random_string <- function(...) {
paste(sample(c(letters, " "), rpois(1, 10), r = TRUE), collapse = "")
}
schools_ids <- data.frame(teamID = 1000:(1000 + n_teams - 1)) %>%
mutate(School = map_chr(teamID, gen_random_string))
teamStats <- data.frame(Year = rep(2005:(2005 + n_years - 1), each = 300)) %>%
mutate(teamID = as.vector(replicate(n_years, sample(schools_ids$teamID, 300))),
G = 32, W = rpois(length(teamID), 16), L = G - W) %>%
left_join(schools_ids)
我們的games
有5萬行,teamStats有3k行。 現在,我們崩潰teamStats
到由tibble Year
和teamID
:
teamStats <- teamStats %>%
group_by(Year, teamID) %>%
nest()
# # A tibble: 3,000 x 3
# Year teamID data
# <int> <int> <list>
# 1 2005 1321 <tibble [1 x 4]>
# 2 2005 1192 <tibble [1 x 4]>
# 3 2005 1074 <tibble [1 x 4]>
# <snip>
做一個小的便利函數來計算差異:
calculate_diff <- function(x, y) {
if (is.null(x) | is.null(y)) {
data.frame(G = NA, W = NA, L = NA)
} else {
x[, 1:3] - y[, 1:3]
}
}
現在,我們(1)加入(或合並) games
與teamStats
,(2)計算使用加入了數據集的差異,以及(3) unnest
(或不崩潰)的數據幀。
start <- Sys.time()
differences <- games %>%
left_join(teamStats, c("Season" = "Year", "teamA" = "teamID")) %>%
rename(teamA_stats = data) %>%
left_join(teamStats, c("Season" = "Year", "teamB" = "teamID")) %>%
rename(teamB_stats = data) %>%
mutate(diff = map2(teamA_stats, teamB_stats, calculate_diff)) %>%
select(Season, teamA, teamB, diff) %>%
unnest()
difftime(Sys.time(), start)
# Time difference of 11.27832 secs
結果
head(differences)
# Season teamA teamB G W L
# 1 2005 1115 1085 NA NA NA
# 2 2005 1315 1177 NA NA NA
# 3 2005 1163 1051 0 -9 9
# 4 2005 1353 1190 0 -4 4
# 5 2005 1376 1286 NA NA NA
# 6 2005 1018 1362 0 -1 1
這是一種使用tidyverse
軟件包的方法,我希望它比OP中的循環解決方案要快得多。 速度(我期望)來自於更多地依賴數據庫聯接操作(例如,基本merge
或dplyr的left_join
)來連接兩個表。
library(tidyverse)
# First, use the first few columns from the `games` table, and convert to long format with
# a row for each team, and a label column `team_cat` telling us if it's a teamA or teamB.
stat_differences <- games %>%
select(row, Season, teamA, teamB) %>%
gather(team_cat, teamID, teamA:teamB) %>%
# Join to the teamStats table to bring in the team's total stats for that year
left_join(teamStats %>% select(-row), # We don't care about this "row"
by = c("teamID", "Season" = "Year")) %>%
# Now I want to reverse the stats' sign if it's a teamB. To make this simpler, I gather
# all the stats into long format so that we can do the reversal on all of them, and
# then spread back out.
gather(stat, value, G:L) %>%
mutate(value = if_else(team_cat == "teamB", value * -1, value * 1)) %>%
spread(stat, value) %>%
# Get the difference in stats for each row in the original games table.
group_by(row) %>%
summarise_at(vars(G:W), sum)
# Finally, add the output to the original table
output <- games %>%
left_join(stat_differences)
為了測試這一點,我更改了給定的樣本數據,以使兩個表相互關聯:
games <- read.table(header = T, stringsAsFactors = F,
text = "row Season teamA teamB winner scoreA scoreB
108123 2010 1143 1293 A 75 70
108124 2010 1198 1314 B 72 88
108125 2010 1108 1326 B 60 100")
teamStats <- read.table(header = T, stringsAsFactors = F,
text = "row School Year teamID G W L
1 abilene_christian 2010 1143 32 16 16
2 air_force 2010 1293 31 12 19
3 akron 2010 1314 32 14 18
4 alabama_a&m 2010 1198 31 3 28
5 alabama-birmingham 2010 1108 33 20 13
6 made_up_team 2018 1326 160 150 10 # To confirm getting right season
7 made_up_team 2010 1326 60 50 10"
)
然后,我得到以下輸出,這似乎很有意義。 (我剛剛意識到,我應用的聚集/變異/擴展改變了列的順序;如果有時間,我可能會嘗試使用mutate_if來保留順序。)
> output
row Season teamA teamB winner scoreA scoreB G L W
1 108123 2010 1143 1293 A 75 70 1 -3 4
2 108124 2010 1198 1314 B 72 88 -1 10 -11
3 108125 2010 1108 1326 B 60 100 -27 3 -30
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.