[英]Rolling conditional count in R
我想創建一個滾動函數,有條件地計算上一行中兩列的出現。
例如,我有一個數據集,如下所示。
# Generate data
set.seed(123)
test <- data.frame(
Round = rep(1:5, times = 3),
Team = rep(c("Team 1", "Team 2", "Team 3"), each = 5),
Venue = sample(sample(c("Venue A", "Venue B"), 15, replace = T))
)
Round Team Venue
1 1 Team 1 Venue B
2 2 Team 1 Venue B
3 3 Team 1 Venue A
4 4 Team 1 Venue A
5 5 Team 1 Venue B
6 1 Team 2 Venue B
7 2 Team 2 Venue B
8 3 Team 2 Venue A
9 4 Team 2 Venue A
10 5 Team 2 Venue A
11 1 Team 3 Venue B
12 2 Team 3 Venue A
13 3 Team 3 Venue B
14 4 Team 3 Venue B
15 5 Team 3 Venue B
我想要一個新的列,該列為每行顯示過去3輪中該行中的團隊在該行場地中的比賽次數。
我可以使用for循環很容易地做到這一點。
window <- 3
for (i in 1:nrow(dat)){
# Create index to search (if i is less than window, start at 1)
index <- max(i - window, 1):i
# Search when current row matches both team and venue
dat$VenueCount[i] <- sum(dat$Team[i] == dat$Team[index] & dat$Venue[i] == dat$Venue[index])
}
Round Team Venue VenueCount
1 1 Team 1 Venue B 1
2 2 Team 1 Venue B 2
3 3 Team 1 Venue A 1
4 4 Team 1 Venue A 2
5 5 Team 1 Venue B 2
6 1 Team 2 Venue B 1
7 2 Team 2 Venue B 2
8 3 Team 2 Venue A 1
9 4 Team 2 Venue A 2
10 5 Team 2 Venue A 3
11 1 Team 3 Venue B 1
12 2 Team 3 Venue A 1
13 3 Team 3 Venue B 2
14 4 Team 3 Venue B 3
15 5 Team 3 Venue B 3
但是,我想避免for循環(主要是因為我的實際數據集相對較大,大約有3萬行)。 我認為它應該與zoo
, dplyr
, purrr
或apply
一種dplyr
,但尚未能夠解決。
謝謝
在此處嘗試使用data.table
解決方案。 如果您只是在尋找dplyr
解決方案, dplyr
被dplyr
您可以使用大小為4的窗口滾動,然后計算與最新行匹配的出現次數。
library(data.table)
library(zoo)
setDT(test)
winsize <- 4
test[, .(Round,
Venue,
VenueCount=rollapplyr(c(rep("", winsize-1), Venue), winsize,
function(x) sum(x==last(x)))),
by=.(Team)]
結果:
# Team Round Venue VenueCount
# 1: Team 1 1 Venue B 1
# 2: Team 1 2 Venue B 2
# 3: Team 1 3 Venue A 1
# 4: Team 1 4 Venue A 2
# 5: Team 1 5 Venue B 2
# 6: Team 2 1 Venue B 1
# 7: Team 2 2 Venue B 2
# 8: Team 2 3 Venue A 1
# 9: Team 2 4 Venue A 2
# 10: Team 2 5 Venue A 3
# 11: Team 3 1 Venue B 1
# 12: Team 3 2 Venue A 1
# 13: Team 3 3 Venue B 2
# 14: Team 3 4 Venue B 3
# 15: Team 3 5 Venue B 3
我實際上使用dplyr::mutate
從tibbletime
包中使用rollify
了一個答案。 將在此處發布,但仍歡迎其他回復!
library(dplyr)
library(tibbletime)
# Create data
set.seed(123)
test <- data.frame(
Round = rep(1:5, times = 3),
Team = rep(c("Team 1", "Team 2", "Team 3"), each = 5),
Venue = sample(sample(c("Venue A", "Venue B"), 15, replace = T))
)
使用rollify
創建自定義函數。
last_n_games = 3
count_games <- rollify(function(x) sum(last(x) == x), window = last_n_games)
現在使用mutate運行該函數。 這將返回前2行的NA(即last_n_games - 1
)。 然后,我可以使用group_by
和row_number
來計算這些首次出現的次數
test <- test %>%
group_by(Team) %>%
mutate(VenueCount = count_games(Venue)) %>%
group_by(Team, Venue) %>%
mutate(VenueCount = ifelse(is.na(VenueCount), row_number(Team), VenueCount))
這將返回以下內容
# A tibble: 15 x 4
# Groups: Team, Venue [6]
Round Team Venue VenueCount
<int> <fct> <fct> <int>
1 1 Team 1 Venue B 1
2 2 Team 1 Venue B 2
3 3 Team 1 Venue A 1
4 4 Team 1 Venue A 2
5 5 Team 1 Venue B 1
6 1 Team 2 Venue B 1
7 2 Team 2 Venue B 2
8 3 Team 2 Venue A 1
9 4 Team 2 Venue A 2
10 5 Team 2 Venue A 3
11 1 Team 3 Venue B 1
12 2 Team 3 Venue A 1
13 3 Team 3 Venue B 2
14 4 Team 3 Venue B 2
15 5 Team 3 Venue B 3
所以我喜歡使用data.table,它快速,通用。
想法是加入2次,兩次延遲(round+1)
和(round+2)
,所以這就是我所做的。
> test1<-test
> test2<-test
> test<-as.data.table(test)
> test1<-as.data.table(test1)
> test2<-as.data.table(test2)
復制副本后,將這些data.frames放入data.table
> test1[,Round:=Round+1,]
> test2[,Round:=Round+2,]
滯后回合,然后像這樣將它們加入在一起:
> test2[test1,on=c('Round','Team')][test,on=c('Round','Team')]
Round Team Venue i.Venue i.Venue.1
1: 1 Team 1 NA NA Venue B
2: 2 Team 1 NA Venue B Venue B
3: 3 Team 1 Venue B Venue B Venue A
4: 4 Team 1 Venue B Venue A Venue A
5: 5 Team 1 Venue A Venue A Venue B
6: 1 Team 2 NA NA Venue B
7: 2 Team 2 NA Venue B Venue B
8: 3 Team 2 Venue B Venue B Venue A
9: 4 Team 2 Venue B Venue A Venue A
10: 5 Team 2 Venue A Venue A Venue A
11: 1 Team 3 NA NA Venue B
12: 2 Team 3 NA Venue B Venue A
13: 3 Team 3 Venue B Venue A Venue B
14: 4 Team 3 Venue A Venue B Venue B
15: 5 Team 3 Venue B Venue B Venue B
由於此結果會導致大量NA,因此我們在他的回答中提到了來自R-Cookbook.com的函數
compareNA <- function(v1,v2) {
# This function returns TRUE wherever elements are the same, including NA's,
# and false everywhere else.
same <- (v1 == v2) | (is.na(v1) & is.na(v2))
same[is.na(same)] <- FALSE
return(same)
}
我們可以得到最終結果:
> end <-
test2[test1, on = c('Round', 'Team')][test, on = c('Round',
'Team')][, VenueCount :=
(1 + compareNA(i.Venue.1, i.Venue) + compareNA(i.Venue.1, Venue)), ]
說明: test2
右連接test1
,上Round
和Team
,以及右連接test
的Round
和Team
,讓您得到:
i.Venue.1
是目前場館Team
, i.Venue
是最后地點Team
, Venue
是最后的2會場Team
,
符合邏輯
(1 + compareNA(i.Venue.1, i.Venue) + compareNA(i.Venue.1, Venue))
您可以計算出球隊在最近3輪比賽中在該場地上踢了多少次。
> end
Round Team Venue i.Venue i.Venue.1 VenueCount
1: 1 Team 1 NA NA Venue B 1
2: 2 Team 1 NA Venue B Venue B 2
3: 3 Team 1 Venue B Venue B Venue A 1
4: 4 Team 1 Venue B Venue A Venue A 2
5: 5 Team 1 Venue A Venue A Venue B 1
6: 1 Team 2 NA NA Venue B 1
7: 2 Team 2 NA Venue B Venue B 2
8: 3 Team 2 Venue B Venue B Venue A 1
9: 4 Team 2 Venue B Venue A Venue A 2
10: 5 Team 2 Venue A Venue A Venue A 3
11: 1 Team 3 NA NA Venue B 1
12: 2 Team 3 NA Venue B Venue A 1
13: 3 Team 3 Venue B Venue A Venue B 2
14: 4 Team 3 Venue A Venue B Venue B 2
15: 5 Team 3 Venue B Venue B Venue B 3
希望這可以幫助
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.