簡體   English   中英

識別組中data.table字段中n個連續數字的組

[英]Identify groups of n consecutive numbers in a data.table field in a group

此data.table顯示學生參加的一年中的幾個月。

DT = data.table(
 Student = c(1, 1, 1, 1, 1, 1, 1, 1, 1,
             2, 2, 2, 2, 2, 2, 2, 2,
             3, 3, 3, 3, 3, 3, 3, 3),
 Month   = c(1, 2, 3, 5, 6, 7, 8, 11, 12,
             2, 3, 4, 5, 7, 8, 9, 10,
             1, 2, 3, 5, 6, 7, 8, 9))

DT
    Student Month
 1:       1     1
 2:       1     2
 3:       1     3
 4:       1     5
 5:       1     6
 6:       1     7
 7:       1     8
 8:       1    11
 9:       1    12
10:       2     2
11:       2     3
12:       2     4
13:       2     5
14:       2     7
15:       2     8
16:       2     9
17:       2    10
18:       3     1
19:       3     2
20:       3     3
21:       3     5
22:       3     6
23:       3     7
24:       3     8
25:       3     9

我想確定連續三個月的時期(由該時期的第一個月確定)。 這是數據表和符合條件的期間的可視化。

       1   2   3   4   5   6   7   8   9   10  11  12


1      *   *   *       *   *   *   *           *   *
       [-------]       [-------]
                           [-------]                           


2          *   *   *   *       *   *   *   *
           [-------]           [-------]
               [-------]           [-------]


3      *   *   *       *   *   *   *   *      
       [-------]       [-------]
                           [-------]
                               [-------]

期望的輸出:

id   First_month_in_the_period 

1    1
1    5
1    6
2    2
2    3
2    7
2    8
3    1
3    5
3    6
3    7

尋找data.table(或dplyr)解決方案。

使用標准方法cumsum...diff...condition )來識別連續值的運行,然后將其與“Student”一起用作分組變量。 在每個組中,根據每次運行的長度創建序列並添加到第一個月。

DT[ , .(start = if(.N >= 3) Month[1] + 0:(.N - 3)),
    by = .(Student, r = cumsum(c(1L, diff(Month) > 1)))]
#     Student r start
#  1:       1 1     1
#  2:       1 2     5
#  3:       1 2     6
#  4:       2 3     2
#  5:       2 3     3
#  6:       2 4     7
#  7:       2 4     8
#  8:       3 4     1
#  9:       3 5     5
# 10:       3 5     6
# 11:       3 5     7

等效的dplyr替代方案:

DT %>% 
  group_by(Student, r = cumsum(c(1L, diff(Month) > 1))) %>%
  summarise(list(data.frame(start = if(n() >= 3) Month[1] + 0:(n() - 3)))) %>%
  tidyr::unnest()

# # A tibble: 11 x 3
# # Groups:   Student [3]
#       Student     r start
#         <dbl> <int> <dbl>
#     1       1     1     1
#     2       1     2     5
#     3       1     2     6
#     4       2     3     2
#     5       2     3     3
#     6       2     4     7
#     7       2     4     8
#     8       3     4     1
#     9       3     5     5
#    10       3     5     6
#    11       3     5     7

使用tidyverse的解決方案。

library(tidyverse)
library(data.table)

DT2 <- DT %>%
  arrange(Student, Month) %>%
  group_by(Student) %>%
  # Create sequence of 3
  mutate(Seq = map(Month, ~seq.int(.x, .x + 2L))) %>%
  # Create a flag to show if the sequence match completely with the Month column 
  mutate(Flag = map_lgl(Seq, ~all(.x %in% Month))) %>%
  # Filter the Flag for TRUE
  filter(Flag) %>%
  # Remove columns
  select(-Seq, -Flag) %>%
  ungroup()

DT2
# # A tibble: 11 x 2
#    Student Month
#      <dbl> <dbl>
#  1       1     1
#  2       1     5
#  3       1     6
#  4       2     2
#  5       2     3
#  6       2     7
#  7       2     8
#  8       3     1
#  9       3     5
# 10       3     6
# 11       3     7

這是一個解決方案,它使用data.table提供的組,

seqfun <- function(month) {
    n <- length(month)
    tmp <- data.table(a=month[1:(n-2)],b=month[2:(n-1)],c=month[3:n])
    month[which(apply(tmp,1,function(x){all(c(1,1)==diff(x))}))]}

Result <- DT[,seqfun(Month), by=Student]
names(Result) <- c("Student","Month")
> Result
    Student Month
 1:       1     1
 2:       1     5
 3:       1     6
 4:       2     2
 5:       2     3
 6:       2     7
 7:       2     8
 8:       3     1
 9:       3     5
10:       3     6
11:       3     7

基本上它需要組月向量,創建3個向量來比較diff並檢查兩個diff是否是1的距離。如果是,則返回原始月向量的索引。

一點點細節。 假設我們有,

month <- c(1,2,3,5,6,7,8,11,12)

我們計算tmp data.table (注意:你也可以在zoo使用rollapply函數來創建一個類似的表,我會在最底層顯示)

   a  b  c
1: 1  2  3
2: 2  3  5
3: 3  5  6
4: 5  6  7
5: 6  7  8
6: 7  8 11
7: 8 11 12

當我們走diff跨行,我們得到,

> apply(tmp,1,function(x){all(c(1,1)==diff(x))})
[1]  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE

真正的價值是我們感興趣的指數。

如上所述,使用zoo圖書館的rollapply我們可以,

> apply(c(1,1)==rollapply(month,width=3,FUN=diff),1,all)
[1]  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE

獲取我們對特定學生感興趣的索引的布爾矢量。

這是一個base R解決方案,它創建一個可以應用於data.table

cons3fun<-function(x,n){

              consec.list<-split(x,cumsum(c(1,diff(x)!=1))) #Splits into list based on consecutive numbers

              min.len.seq<-consec.list[which(sapply(consec.list,length)>(n-1))] #Selects only the list elements >= to n

              seq.start<-lapply(min.len.seq,function(i) i[1:(length(i)-(n-1))]) #Extracts the first number of each sequence of n

              return(as.vector(unlist(seq.start))) #Returns result as a vector
}

請注意,此功能允許您相當容易地更改您要查找的連續數字的數量。 在這里你會使用n=3 然后,您可以使用data.tabledplyr應用此函數。 我將使用data.table因為你使用了一個。

DT[,cons3fun(Month,3),by=.(Student)]

希望您覺得這個有幫助。 祝好運!

這是我使用tidyverse方法:

> as_tibble(DT) %>%
      arrange(Student, Month) %>%
      group_by(Student) %>%
      # create an identifier for the start of the sequence
      mutate(seq_id = ifelse(row_number() == 1 | Month - lag(Month) > 1,
                             letters[row_number()], NA)) %>%
      fill(seq_id) %>%
      # add another grouping level (sequence identifier)
      group_by(Student, seq_id) %>%
      # only keep data with attendance in 3 or more consecutive months 
      filter(length(seq_id) > 2) %>%
      # n consecutive months => n - 2 periods
      slice(1:(n() - 2)) %>%
      # clean up
      ungroup() %>%
      select(Student, Month)
# A tibble: 11 x 2
#   Student Month
#    <dbl> <dbl>
#1       1     1
#2       1     5
#3       1     6
#4       2     2
#5       2     3
#6       2     7
#7       2     8
#8       3     1
#9       3     5
#10      3     6
#11      3     7

另一種data.table方法......

#first, clculate the difference between months, by student.
ans <- DT[, diff := shift( Month, type = "lead" ) - Month ), by = .(Student)]
#then filter rows that are at the start of 2 consecutive differences of 1
#also, drop the temporary diff-column
ans[ diff == 1 & shift( diff, type = "lead" ) == 1,][, diff := NULL][]

#    Student Month
# 1:       1     1
# 2:       1     5
# 3:       1     6
# 4:       2     2
# 5:       2     3
# 6:       2     7
# 7:       2     8
# 8:       3     1
# 9:       3     5
# 10:      3     6
# 11:      3     7

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM