簡體   English   中英

R中的滾動條件計數

[英]Rolling conditional count in R

我想創建一個滾動函數,有條件地計算上一行中兩列的出現。

例如,我有一個數據集,如下所示。

# Generate data
set.seed(123)
test <- data.frame(
  Round = rep(1:5, times = 3),
  Team = rep(c("Team 1", "Team 2", "Team 3"), each = 5),
  Venue = sample(sample(c("Venue A", "Venue B"), 15, replace = T))
)

   Round   Team   Venue
1      1 Team 1 Venue B
2      2 Team 1 Venue B
3      3 Team 1 Venue A
4      4 Team 1 Venue A
5      5 Team 1 Venue B
6      1 Team 2 Venue B
7      2 Team 2 Venue B
8      3 Team 2 Venue A
9      4 Team 2 Venue A
10     5 Team 2 Venue A
11     1 Team 3 Venue B
12     2 Team 3 Venue A
13     3 Team 3 Venue B
14     4 Team 3 Venue B
15     5 Team 3 Venue B

我想要一個新的列,該列為每行顯示過去3輪中該行中的團隊在該行場地中的比賽次數。

我可以使用for循環很容易地做到這一點。

window <- 3

for (i in 1:nrow(dat)){
  # Create index to search (if i is less than window, start at 1)
  index <- max(i - window, 1):i

  # Search when current row matches both team and venue
  dat$VenueCount[i] <- sum(dat$Team[i] == dat$Team[index] & dat$Venue[i] == dat$Venue[index])
}

   Round   Team   Venue VenueCount
1      1 Team 1 Venue B          1
2      2 Team 1 Venue B          2
3      3 Team 1 Venue A          1
4      4 Team 1 Venue A          2
5      5 Team 1 Venue B          2
6      1 Team 2 Venue B          1
7      2 Team 2 Venue B          2
8      3 Team 2 Venue A          1
9      4 Team 2 Venue A          2
10     5 Team 2 Venue A          3
11     1 Team 3 Venue B          1
12     2 Team 3 Venue A          1
13     3 Team 3 Venue B          2
14     4 Team 3 Venue B          3
15     5 Team 3 Venue B          3

但是,我想避免for循環(主要是因為我的實際數據集相對較大,大約有3萬行)。 我認為它應該與zoodplyrpurrrapply一種dplyr ,但尚未能夠解決。

謝謝

在此處嘗試使用data.table解決方案。 如果您只是在尋找dplyr解決方案, dplyrdplyr

您可以使用大小為4的窗口滾動,然后計算與最新行匹配的出現次數。

library(data.table)
library(zoo)
setDT(test)
winsize <- 4
test[, .(Round, 
        Venue, 
        VenueCount=rollapplyr(c(rep("", winsize-1), Venue), winsize, 
            function(x) sum(x==last(x)))), 
    by=.(Team)]

結果:

#       Team Round   Venue VenueCount
#  1: Team 1     1 Venue B          1
#  2: Team 1     2 Venue B          2
#  3: Team 1     3 Venue A          1
#  4: Team 1     4 Venue A          2
#  5: Team 1     5 Venue B          2
#  6: Team 2     1 Venue B          1
#  7: Team 2     2 Venue B          2
#  8: Team 2     3 Venue A          1
#  9: Team 2     4 Venue A          2
# 10: Team 2     5 Venue A          3
# 11: Team 3     1 Venue B          1
# 12: Team 3     2 Venue A          1
# 13: Team 3     3 Venue B          2
# 14: Team 3     4 Venue B          3
# 15: Team 3     5 Venue B          3

我實際上使用dplyr::mutatetibbletime包中使用rollify了一個答案。 將在此處發布,但仍歡迎其他回復!

library(dplyr)
library(tibbletime)

# Create data
set.seed(123)
test <- data.frame(
  Round = rep(1:5, times = 3),
  Team = rep(c("Team 1", "Team 2", "Team 3"), each = 5),
  Venue = sample(sample(c("Venue A", "Venue B"), 15, replace = T))
)

使用rollify創建自定義函數。

last_n_games = 3
count_games <- rollify(function(x) sum(last(x) == x), window = last_n_games)

現在使用mutate運行該函數。 這將返回前2行的NA(即last_n_games - 1 )。 然后,我可以使用group_byrow_number來計算這些首次出現的次數

test <- test %>%
  group_by(Team) %>%
  mutate(VenueCount = count_games(Venue)) %>%
  group_by(Team, Venue) %>%
  mutate(VenueCount = ifelse(is.na(VenueCount), row_number(Team), VenueCount))

這將返回以下內容

# A tibble: 15 x 4
# Groups:   Team, Venue [6]
   Round Team   Venue   VenueCount
   <int> <fct>  <fct>        <int>
 1     1 Team 1 Venue B          1
 2     2 Team 1 Venue B          2
 3     3 Team 1 Venue A          1
 4     4 Team 1 Venue A          2
 5     5 Team 1 Venue B          1
 6     1 Team 2 Venue B          1
 7     2 Team 2 Venue B          2
 8     3 Team 2 Venue A          1
 9     4 Team 2 Venue A          2
10     5 Team 2 Venue A          3
11     1 Team 3 Venue B          1
12     2 Team 3 Venue A          1
13     3 Team 3 Venue B          2
14     4 Team 3 Venue B          2
15     5 Team 3 Venue B          3

所以我喜歡使用data.table,它快速,通用。

想法是加入2次,兩次延遲(round+1)(round+2) ,所以這就是我所做的。

> test1<-test
> test2<-test
> test<-as.data.table(test)
> test1<-as.data.table(test1)
> test2<-as.data.table(test2)

復制副本后,將這些data.frames放入data.table

> test1[,Round:=Round+1,]
> test2[,Round:=Round+2,]

滯后回合,然后像這樣將它們加入在一起:

> test2[test1,on=c('Round','Team')][test,on=c('Round','Team')]
    Round   Team   Venue i.Venue i.Venue.1
 1:     1 Team 1      NA      NA   Venue B
 2:     2 Team 1      NA Venue B   Venue B
 3:     3 Team 1 Venue B Venue B   Venue A
 4:     4 Team 1 Venue B Venue A   Venue A
 5:     5 Team 1 Venue A Venue A   Venue B
 6:     1 Team 2      NA      NA   Venue B
 7:     2 Team 2      NA Venue B   Venue B
 8:     3 Team 2 Venue B Venue B   Venue A
 9:     4 Team 2 Venue B Venue A   Venue A
10:     5 Team 2 Venue A Venue A   Venue A
11:     1 Team 3      NA      NA   Venue B
12:     2 Team 3      NA Venue B   Venue A
13:     3 Team 3 Venue B Venue A   Venue B
14:     4 Team 3 Venue A Venue B   Venue B
15:     5 Team 3 Venue B Venue B   Venue B

由於此結果會導致大量NA,因此我們在他的回答中提到了來自R-Cookbook.com的函數

  compareNA <- function(v1,v2) {
    # This function returns TRUE wherever elements are the same, including NA's,
    # and false everywhere else.
    same <- (v1 == v2)  |  (is.na(v1) & is.na(v2))
    same[is.na(same)] <- FALSE
    return(same)
   }

我們可以得到最終結果:

 > end <-
      test2[test1, on = c('Round', 'Team')][test, on = c('Round', 
      'Team')][, VenueCount :=
      (1 + compareNA(i.Venue.1, i.Venue) + compareNA(i.Venue.1, Venue)), ]

說明: test2右連接test1 ,上RoundTeam ,以及右連接testRoundTeam ,讓您得到:

i.Venue.1是目前場館Teami.Venue是最后地點TeamVenue是最后的2會場Team

符合邏輯

(1 + compareNA(i.Venue.1, i.Venue) + compareNA(i.Venue.1, Venue))

您可以計算出球隊在最近3輪比賽中在該場地上踢了多少次。

> end
    Round   Team   Venue i.Venue i.Venue.1 VenueCount
 1:     1 Team 1      NA      NA   Venue B          1
 2:     2 Team 1      NA Venue B   Venue B          2
 3:     3 Team 1 Venue B Venue B   Venue A          1
 4:     4 Team 1 Venue B Venue A   Venue A          2
 5:     5 Team 1 Venue A Venue A   Venue B          1
 6:     1 Team 2      NA      NA   Venue B          1
 7:     2 Team 2      NA Venue B   Venue B          2
 8:     3 Team 2 Venue B Venue B   Venue A          1
 9:     4 Team 2 Venue B Venue A   Venue A          2
10:     5 Team 2 Venue A Venue A   Venue A          3
11:     1 Team 3      NA      NA   Venue B          1
12:     2 Team 3      NA Venue B   Venue A          1
13:     3 Team 3 Venue B Venue A   Venue B          2
14:     4 Team 3 Venue A Venue B   Venue B          2
15:     5 Team 3 Venue B Venue B   Venue B          3

希望這可以幫助

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM