[英]Identify groups of n consecutive numbers in a data.table field in a group
此data.table顯示學生參加的一年中的幾個月。
DT = data.table(
Student = c(1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3, 3, 3, 3, 3),
Month = c(1, 2, 3, 5, 6, 7, 8, 11, 12,
2, 3, 4, 5, 7, 8, 9, 10,
1, 2, 3, 5, 6, 7, 8, 9))
DT
Student Month
1: 1 1
2: 1 2
3: 1 3
4: 1 5
5: 1 6
6: 1 7
7: 1 8
8: 1 11
9: 1 12
10: 2 2
11: 2 3
12: 2 4
13: 2 5
14: 2 7
15: 2 8
16: 2 9
17: 2 10
18: 3 1
19: 3 2
20: 3 3
21: 3 5
22: 3 6
23: 3 7
24: 3 8
25: 3 9
我想確定連續三個月的時期(由該時期的第一個月確定)。 這是數據表和符合條件的期間的可視化。
1 2 3 4 5 6 7 8 9 10 11 12
1 * * * * * * * * *
[-------] [-------]
[-------]
2 * * * * * * * *
[-------] [-------]
[-------] [-------]
3 * * * * * * * *
[-------] [-------]
[-------]
[-------]
期望的輸出:
id First_month_in_the_period
1 1
1 5
1 6
2 2
2 3
2 7
2 8
3 1
3 5
3 6
3 7
尋找data.table(或dplyr)解決方案。
使用標准方法 ( cumsum...diff...condition
)來識別連續值的運行,然后將其與“Student”一起用作分組變量。 在每個組中,根據每次運行的長度創建序列並添加到第一個月。
DT[ , .(start = if(.N >= 3) Month[1] + 0:(.N - 3)),
by = .(Student, r = cumsum(c(1L, diff(Month) > 1)))]
# Student r start
# 1: 1 1 1
# 2: 1 2 5
# 3: 1 2 6
# 4: 2 3 2
# 5: 2 3 3
# 6: 2 4 7
# 7: 2 4 8
# 8: 3 4 1
# 9: 3 5 5
# 10: 3 5 6
# 11: 3 5 7
等效的dplyr
替代方案:
DT %>%
group_by(Student, r = cumsum(c(1L, diff(Month) > 1))) %>%
summarise(list(data.frame(start = if(n() >= 3) Month[1] + 0:(n() - 3)))) %>%
tidyr::unnest()
# # A tibble: 11 x 3
# # Groups: Student [3]
# Student r start
# <dbl> <int> <dbl>
# 1 1 1 1
# 2 1 2 5
# 3 1 2 6
# 4 2 3 2
# 5 2 3 3
# 6 2 4 7
# 7 2 4 8
# 8 3 4 1
# 9 3 5 5
# 10 3 5 6
# 11 3 5 7
使用tidyverse
的解決方案。
library(tidyverse)
library(data.table)
DT2 <- DT %>%
arrange(Student, Month) %>%
group_by(Student) %>%
# Create sequence of 3
mutate(Seq = map(Month, ~seq.int(.x, .x + 2L))) %>%
# Create a flag to show if the sequence match completely with the Month column
mutate(Flag = map_lgl(Seq, ~all(.x %in% Month))) %>%
# Filter the Flag for TRUE
filter(Flag) %>%
# Remove columns
select(-Seq, -Flag) %>%
ungroup()
DT2
# # A tibble: 11 x 2
# Student Month
# <dbl> <dbl>
# 1 1 1
# 2 1 5
# 3 1 6
# 4 2 2
# 5 2 3
# 6 2 7
# 7 2 8
# 8 3 1
# 9 3 5
# 10 3 6
# 11 3 7
這是一個解決方案,它使用data.table提供的組,
seqfun <- function(month) {
n <- length(month)
tmp <- data.table(a=month[1:(n-2)],b=month[2:(n-1)],c=month[3:n])
month[which(apply(tmp,1,function(x){all(c(1,1)==diff(x))}))]}
Result <- DT[,seqfun(Month), by=Student]
names(Result) <- c("Student","Month")
> Result
Student Month
1: 1 1
2: 1 5
3: 1 6
4: 2 2
5: 2 3
6: 2 7
7: 2 8
8: 3 1
9: 3 5
10: 3 6
11: 3 7
基本上它需要組月向量,創建3個向量來比較diff
並檢查兩個diff
是否是1的距離。如果是,則返回原始月向量的索引。
一點點細節。 假設我們有,
month <- c(1,2,3,5,6,7,8,11,12)
我們計算tmp
data.table
(注意:你也可以在zoo
使用rollapply
函數來創建一個類似的表,我會在最底層顯示)
a b c
1: 1 2 3
2: 2 3 5
3: 3 5 6
4: 5 6 7
5: 6 7 8
6: 7 8 11
7: 8 11 12
當我們走diff
跨行,我們得到,
> apply(tmp,1,function(x){all(c(1,1)==diff(x))})
[1] TRUE FALSE FALSE TRUE TRUE FALSE FALSE
真正的價值是我們感興趣的指數。
如上所述,使用zoo
圖書館的rollapply
我們可以,
> apply(c(1,1)==rollapply(month,width=3,FUN=diff),1,all)
[1] TRUE FALSE FALSE TRUE TRUE FALSE FALSE
獲取我們對特定學生感興趣的索引的布爾矢量。
這是一個base
R解決方案,它創建一個可以應用於data.table
:
cons3fun<-function(x,n){
consec.list<-split(x,cumsum(c(1,diff(x)!=1))) #Splits into list based on consecutive numbers
min.len.seq<-consec.list[which(sapply(consec.list,length)>(n-1))] #Selects only the list elements >= to n
seq.start<-lapply(min.len.seq,function(i) i[1:(length(i)-(n-1))]) #Extracts the first number of each sequence of n
return(as.vector(unlist(seq.start))) #Returns result as a vector
}
請注意,此功能允許您相當容易地更改您要查找的連續數字的數量。 在這里你會使用n=3
。 然后,您可以使用data.table
或dplyr
應用此函數。 我將使用data.table
因為你使用了一個。
DT[,cons3fun(Month,3),by=.(Student)]
希望您覺得這個有幫助。 祝好運!
這是我使用tidyverse
方法:
> as_tibble(DT) %>%
arrange(Student, Month) %>%
group_by(Student) %>%
# create an identifier for the start of the sequence
mutate(seq_id = ifelse(row_number() == 1 | Month - lag(Month) > 1,
letters[row_number()], NA)) %>%
fill(seq_id) %>%
# add another grouping level (sequence identifier)
group_by(Student, seq_id) %>%
# only keep data with attendance in 3 or more consecutive months
filter(length(seq_id) > 2) %>%
# n consecutive months => n - 2 periods
slice(1:(n() - 2)) %>%
# clean up
ungroup() %>%
select(Student, Month)
# A tibble: 11 x 2
# Student Month
# <dbl> <dbl>
#1 1 1
#2 1 5
#3 1 6
#4 2 2
#5 2 3
#6 2 7
#7 2 8
#8 3 1
#9 3 5
#10 3 6
#11 3 7
另一種data.table
方法......
#first, clculate the difference between months, by student.
ans <- DT[, diff := shift( Month, type = "lead" ) - Month ), by = .(Student)]
#then filter rows that are at the start of 2 consecutive differences of 1
#also, drop the temporary diff-column
ans[ diff == 1 & shift( diff, type = "lead" ) == 1,][, diff := NULL][]
瞧
# Student Month
# 1: 1 1
# 2: 1 5
# 3: 1 6
# 4: 2 2
# 5: 2 3
# 6: 2 7
# 7: 2 8
# 8: 3 1
# 9: 3 5
# 10: 3 6
# 11: 3 7
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.