简体   繁体   English

使用dplyr标记和计数值之间的差距

[英]Using dplyr to label and count gaps between values

I have this dataframe: 我有这个数据框:

    df<-structure(list(Name = c("sub1", "sub1", "sub1", "sub1", "sub1", 
                            "sub1", "sub1", "sub1", "sub1", "sub1", "sub1", "sub1", "sub1", 
                            "sub1", "sub1", "sub1", "sub1", "sub1", "sub1", "sub1", "sub1", 
                            "sub1", "sub1", "sub2", "sub2", "sub2", "sub2", "sub2", "sub2"
), StimulusName = c("Alpha11", "Alpha11", "Alpha11", "Alpha11", 
                    "Alpha11", "Alpha11", "Alpha11", "Alpha11", "Alpha11", "Alpha11", 
                    "Alpha11", "Alpha11", "Alpha11", "Alpha11", "Alpha11", "Alpha11", 
                    "Alpha11", "Alpha11", "Alpha12", "Alpha12", "Alpha12", "Alpha12", 
                    "Alpha12", "Alpha11", "Alpha11", "Alpha11", "Alpha11", "Alpha11", 
                    "Alpha11"), FixationSeq = c(2L, 2L, 2L, 2L, NA, NA, NA, NA, 3L, 
                                                3L, 3L, 3L, 3L, NA, NA, NA, NA, NA, 1L, NA, NA, 2L, NA, NA, NA, 
                                                NA, NA, 2L, 2L)), row.names = c(NA, -29L), class = c("tbl_df", 
                                                                                                     "tbl", "data.frame"), spec = structure(list(cols = list(Name = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                "collector")), StimulusName = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                                                                          "collector")), FixationSeq = structure(list(), class = c("collector_integer", 
                                                                                                                                                                                                                                                                                                                   "collector"))), default = structure(list(), class = c("collector_guess", 
                                                                                                                                                                                                                                                                                                                                                                         "collector"))), class = "col_spec"))

In the column FixationSeq there are unique numbers (in my example 2 and 3 for Name = sub1 and StimulusName = Alpha11 ). FixationSeq列中有唯一编号(在我的示例2和3中, Name = sub1StimulusName = Alpha11 )。 Between these numbers there are segments filled with NA . 在这些数字之间NA填充的段。 There is also a segment after 3 filled with NA . 3之后还有一个用NA填充的段。

I would like to be able create a new column SaccadeCount and add an incrementing numerical label to every instance of an NA segment (as a whole, ie potentially multiple rows) to the relevant rows in SaccadeCount . 我希望能够创建一个新列SaccadeCount并将一个递增的数字标签添加到NA 段的每个实例(作为一个整体,即可能是多行)到SaccadeCount的相关行。

Additionally, I'd like to have another column called SaccadeDuration and total the number of rows where unique segments of NA appear. 另外,我想再有一个名为SaccadeDuration的列,并SaccadeDuration出现NA唯一段的总行数。 So in the example df the rows corresponding to the NA segment between 2 and 3 would be populated with '3' since that is the total number of rows between 2 and 3. 因此在示例df ,与2和3之间的NA段相对应的行将填充为'3',因为那是2和3之间的行的总数。

I would like to accomplish this using dplyr and group the operation by the columns Name and StimulusName . 我想使用dplyr完成此操作,并按NameStimulusName列对操作进行分组。

An output might look something like this: 输出可能看起来像这样:

    Name    StimulusName    FixationSeq SaccadeCount    SaccadeDuration
   sub1     Alpha11             2       
   sub1     Alpha11             2       
   sub1     Alpha11             2       
   sub1     Alpha11             2       
   sub1     Alpha11             NA            1              3
   sub1     Alpha11             NA            1              3
   sub1     Alpha11             NA            1              3
   sub1     Alpha11             3       
   sub1     Alpha11             3       
   sub1     Alpha11             3       
   sub1     Alpha11             3       
   sub1     Alpha11             3       
   sub1     Alpha11             3       
   sub1     Alpha11             NA            2              5
   sub1     Alpha11             NA            2              5
   sub1     Alpha11             NA            2              5
   sub1     Alpha11             NA            2              5
   sub1     Alpha11             NA            2              5
   sub1     Alpha12             1       
   sub1     Alpha12             NA            1              2      
   sub1     Alpha12             NA            1              2
   sub1     Alpha12             2
   sub1     Alpha12             NA            2              1  
   sub2     Alpha11             NA            1              4
   sub2     Alpha11             NA            1              4
   sub2     Alpha11             NA            1              4
   sub2     Alpha11             NA            1              4
   sub2     Alpha11             2                  
   sub2     Alpha11             2 

Thank you very much for your time and help. 非常感谢您的时间和帮助。

Using data.table 使用data.table

code: 码:

library(data.table)
fun1 <- function(x) {
    na.ind = is.na(x$FixationSeq)
    na.vals= rleidv(rleidv(na.ind)[na.ind])
    x$SaccadeCount = NA
    x$SaccadeCount[na.ind] = na.vals

    na.rle = rle(na.vals)
    x$SaccadeDuration = NA
    x$SaccadeDuration[na.ind] = rep(na.rle$lengths, na.rle$lengths)

    return(x)
    }

setDT(df)[, fun1(.SD) ,by = .(Name, StimulusName)]

You can use fun1 in a dplyr fashion: 您可以以dplyr方式使用fun1

ans<-
df %>% group_by(Name, StimulusName) %>% dplyr::do(.data = ., fun1(.))

result: 结果:

 #   Name StimulusName FixationSeq SaccadeCount SaccadeDuration
 #1: sub1      Alpha11           2           NA              NA
 #2: sub1      Alpha11           2           NA              NA
 #3: sub1      Alpha11           2           NA              NA
 #4: sub1      Alpha11           2           NA              NA
 #5: sub1      Alpha11           2           NA              NA
 #6: sub1      Alpha11           2           NA              NA
 #7: sub1      Alpha11           2           NA              NA
 #8: sub1      Alpha11           2           NA              NA
 #9: sub1      Alpha11           2           NA              NA
#10: sub1      Alpha11           2           NA              NA
#11: sub1      Alpha11           2           NA              NA
#12: sub1      Alpha11           2           NA              NA
#13: sub1      Alpha11           2           NA              NA
#14: sub1      Alpha11           2           NA              NA
#15: sub1      Alpha11           2           NA              NA
#16: sub1      Alpha11           2           NA              NA
#17: sub1      Alpha11           2           NA              NA
#18: sub1      Alpha11           2           NA              NA
#19: sub1      Alpha11           2           NA              NA
#20: sub1      Alpha11           2           NA              NA
#21: sub1      Alpha11           2           NA              NA
#22: sub1      Alpha11          NA            1               5
#23: sub1      Alpha11          NA            1               5
#24: sub1      Alpha11          NA            1               5
#25: sub1      Alpha11          NA            1               5
#26: sub1      Alpha11          NA            1               5
#27: sub1       Alpha1           9           NA              NA
#28: sub1       Alpha1           9           NA              NA
#29: sub1       Alpha1           9           NA              NA
#30: sub1       Alpha1           9           NA              NA
#31: sub1       Alpha1           9           NA              NA
#32: sub1       Alpha1           9           NA              NA
#33: sub1       Alpha1           9           NA              NA
#    Name StimulusName FixationSeq SaccadeCount SaccadeDuration

  • My approach uses a predefined function fun1 that does the job for each group. 我的方法使用预定义的功能fun1来为每个组完成工作。
  • The groups seem to be defined my Name and StimulusName 这些组似乎定义为我的NameStimulusName
  • I use very important functions that you should learn about ?rle , ?rleidv 我用的,你应该了解非常重要的功能?rle?rleidv
  • I prepopulate the new column with all NA -values, then I add the new values where needed. 我用所有NA值预填充新列,然后在需要的地方添加新值。

This should do it. 这应该做。 Maybe there is an easier way, though. 不过,也许有一种更简单的方法。 The first mutate indicates the start of an NA segment. 第一个突变指示NA片段的开始。 The group_by and the second mutate count the NA s for each segment. group_by和第二个突变计数每个段的NA

library(dplyr)
df %>% mutate(SaccadeCount = cumsum(ifelse(is.na(FixationSeq) & 
              !is.na(lag(FixationSeq)), 1,0)) * is.na(FixationSeq)) %>%
    group_by(SaccadeCount) %>%
    mutate(SaccadeDuration = n()) %>%
    ungroup() %>%
    mutate(SaccadeDuration = SaccadeDuration * is.na(FixationSeq))

Using dplyr : 使用dplyr

df %>%
  group_by(Name, StimulusName) %>%
  mutate(x = is.na(FixationSeq),
         count = cumsum(c(TRUE, diff(x) != 0L) & x) * x,
         dur = NA_integer_) %>%
  group_by(Name, StimulusName, count) %>%
  mutate(dur = replace(dur, as.logical(count), n()))

Corresponding (more succint) data.table version: 对应的(更data.tabledata.table版本:

library(data.table)
setDT(df)

df[ , count := ({
  x <- is.na(FixationSeq)
  .(cumsum(c(TRUE, diff(x) != 0L) & x) * x)}), by = .(Name, StimulusName)]

df[as.logical(count), dur := .N, by = .(Name, StimulusName, count)]
  Name StimulusName FixationSeq count dur 1: sub1 Alpha11 2 0 NA 2: sub1 Alpha11 2 0 NA 3: sub1 Alpha11 2 0 NA 4: sub1 Alpha11 2 0 NA 5: sub1 Alpha11 NA 1 4 6: sub1 Alpha11 NA 1 4 7: sub1 Alpha11 NA 1 4 8: sub1 Alpha11 NA 1 4 9: sub1 Alpha11 3 0 NA 10: sub1 Alpha11 3 0 NA 11: sub1 Alpha11 3 0 NA 12: sub1 Alpha11 3 0 NA 13: sub1 Alpha11 3 0 NA 14: sub1 Alpha11 NA 2 5 15: sub1 Alpha11 NA 2 5 16: sub1 Alpha11 NA 2 5 17: sub1 Alpha11 NA 2 5 18: sub1 Alpha11 NA 2 5 19: sub1 Alpha12 1 0 NA 20: sub1 Alpha12 NA 1 2 21: sub1 Alpha12 NA 1 2 22: sub1 Alpha12 2 0 NA 23: sub1 Alpha12 NA 2 1 24: sub2 Alpha11 NA 1 4 25: sub2 Alpha11 NA 1 4 26: sub2 Alpha11 NA 1 4 27: sub2 Alpha11 NA 1 4 28: sub2 Alpha11 2 0 NA 29: sub2 Alpha11 2 0 NA Name StimulusName FixationSeq count dur 

If desired, change count == 0 to NA : 如果需要,将count == 0更改为NA

df[count == 0, count := NA]

I would not change it to 'blank' ( "" ), as shown in the question, because this would coerce the column to character and render the numbers useless for further analyses. 如问题所示,我不会将其更改为'blank'( "" ),因为这将迫使该列具有character ,并使这些数字无法用于进一步的分析。


The cumsum(c(TRUE, diff(x) != 0L) & x) * x part step by step: cumsum(c(TRUE, diff(x) != 0L) & x) * x逐步说明:

v <- c(1, 1, NA, NA, 1, NA, NA, NA)
x <- is.na(v)
x
# [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

diff(x)
# [1]  0  1  0 -1  1  0  0

diff(x) != 0L
# [1] FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE

c(TRUE, diff(x) != 0L) & x
# [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE

cumsum(c(TRUE, diff(x) != 0L) & x)
# [1] 0 0 1 1 1 2 2 2

cumsum(c(TRUE, diff(x) != 0L) & x) * x
# [1] 0 0 1 1 0 2 2 2

The rest is hopefully rather straightforward. 其余的希望相当简单。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM