使用dplyr标记和计数值之间的差距

Question

I have this dataframe: 我有这个数据框：

    df<-structure(list(Name = c("sub1", "sub1", "sub1", "sub1", "sub1", 
                            "sub1", "sub1", "sub1", "sub1", "sub1", "sub1", "sub1", "sub1", 
                            "sub1", "sub1", "sub1", "sub1", "sub1", "sub1", "sub1", "sub1", 
                            "sub1", "sub1", "sub2", "sub2", "sub2", "sub2", "sub2", "sub2"
), StimulusName = c("Alpha11", "Alpha11", "Alpha11", "Alpha11", 
                    "Alpha11", "Alpha11", "Alpha11", "Alpha11", "Alpha11", "Alpha11", 
                    "Alpha11", "Alpha11", "Alpha11", "Alpha11", "Alpha11", "Alpha11", 
                    "Alpha11", "Alpha11", "Alpha12", "Alpha12", "Alpha12", "Alpha12", 
                    "Alpha12", "Alpha11", "Alpha11", "Alpha11", "Alpha11", "Alpha11", 
                    "Alpha11"), FixationSeq = c(2L, 2L, 2L, 2L, NA, NA, NA, NA, 3L, 
                                                3L, 3L, 3L, 3L, NA, NA, NA, NA, NA, 1L, NA, NA, 2L, NA, NA, NA, 
                                                NA, NA, 2L, 2L)), row.names = c(NA, -29L), class = c("tbl_df", 
                                                                                                     "tbl", "data.frame"), spec = structure(list(cols = list(Name = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                "collector")), StimulusName = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                                                                          "collector")), FixationSeq = structure(list(), class = c("collector_integer", 
                                                                                                                                                                                                                                                                                                                   "collector"))), default = structure(list(), class = c("collector_guess", 
                                                                                                                                                                                                                                                                                                                                                                         "collector"))), class = "col_spec"))

In the column FixationSeq there are unique numbers (in my example 2 and 3 for Name = sub1 and StimulusName = Alpha11 ). 在FixationSeq列中有唯一编号（在我的示例2和3中， Name = sub1和StimulusName = Alpha11 ）。 Between these numbers there are segments filled with NA . 在这些数字之间NA填充的段。 There is also a segment after 3 filled with NA . 3之后还有一个用NA填充的段。

I would like to be able create a new column SaccadeCount and add an incrementing numerical label to every instance of an NA segment (as a whole, ie potentially multiple rows) to the relevant rows in SaccadeCount . 我希望能够创建一个新列SaccadeCount并将一个递增的数字标签添加到NA 段的每个实例（作为一个整体，即可能是多行）到SaccadeCount的相关行。

Additionally, I'd like to have another column called SaccadeDuration and total the number of rows where unique segments of NA appear. 另外，我想再有一个名为SaccadeDuration的列，并SaccadeDuration出现NA唯一段的总行数。 So in the example df the rows corresponding to the NA segment between 2 and 3 would be populated with '3' since that is the total number of rows between 2 and 3. 因此在示例df ，与2和3之间的NA段相对应的行将填充为'3'，因为那是2和3之间的行的总数。

I would like to accomplish this using dplyr and group the operation by the columns Name and StimulusName . 我想使用dplyr完成此操作，并按Name和StimulusName列对操作进行分组。

An output might look something like this: 输出可能看起来像这样：

    Name    StimulusName    FixationSeq SaccadeCount    SaccadeDuration
   sub1     Alpha11             2       
   sub1     Alpha11             2       
   sub1     Alpha11             2       
   sub1     Alpha11             2       
   sub1     Alpha11             NA            1              3
   sub1     Alpha11             NA            1              3
   sub1     Alpha11             NA            1              3
   sub1     Alpha11             3       
   sub1     Alpha11             3       
   sub1     Alpha11             3       
   sub1     Alpha11             3       
   sub1     Alpha11             3       
   sub1     Alpha11             3       
   sub1     Alpha11             NA            2              5
   sub1     Alpha11             NA            2              5
   sub1     Alpha11             NA            2              5
   sub1     Alpha11             NA            2              5
   sub1     Alpha11             NA            2              5
   sub1     Alpha12             1       
   sub1     Alpha12             NA            1              2      
   sub1     Alpha12             NA            1              2
   sub1     Alpha12             2
   sub1     Alpha12             NA            2              1  
   sub2     Alpha11             NA            1              4
   sub2     Alpha11             NA            1              4
   sub2     Alpha11             NA            1              4
   sub2     Alpha11             NA            1              4
   sub2     Alpha11             2                  
   sub2     Alpha11             2

Thank you very much for your time and help. 非常感谢您的时间和帮助。

Answer 1

Using data.table 使用data.table

code: 码：

library(data.table)
fun1 <- function(x) {
    na.ind = is.na(x$FixationSeq)
    na.vals= rleidv(rleidv(na.ind)[na.ind])
    x$SaccadeCount = NA
    x$SaccadeCount[na.ind] = na.vals

    na.rle = rle(na.vals)
    x$SaccadeDuration = NA
    x$SaccadeDuration[na.ind] = rep(na.rle$lengths, na.rle$lengths)

    return(x)
    }

setDT(df)[, fun1(.SD) ,by = .(Name, StimulusName)]

You can use fun1 in a dplyr fashion: 您可以以dplyr方式使用fun1 ：

ans<-
df %>% group_by(Name, StimulusName) %>% dplyr::do(.data = ., fun1(.))

result: 结果：

 #   Name StimulusName FixationSeq SaccadeCount SaccadeDuration
 #1: sub1      Alpha11           2           NA              NA
 #2: sub1      Alpha11           2           NA              NA
 #3: sub1      Alpha11           2           NA              NA
 #4: sub1      Alpha11           2           NA              NA
 #5: sub1      Alpha11           2           NA              NA
 #6: sub1      Alpha11           2           NA              NA
 #7: sub1      Alpha11           2           NA              NA
 #8: sub1      Alpha11           2           NA              NA
 #9: sub1      Alpha11           2           NA              NA
#10: sub1      Alpha11           2           NA              NA
#11: sub1      Alpha11           2           NA              NA
#12: sub1      Alpha11           2           NA              NA
#13: sub1      Alpha11           2           NA              NA
#14: sub1      Alpha11           2           NA              NA
#15: sub1      Alpha11           2           NA              NA
#16: sub1      Alpha11           2           NA              NA
#17: sub1      Alpha11           2           NA              NA
#18: sub1      Alpha11           2           NA              NA
#19: sub1      Alpha11           2           NA              NA
#20: sub1      Alpha11           2           NA              NA
#21: sub1      Alpha11           2           NA              NA
#22: sub1      Alpha11          NA            1               5
#23: sub1      Alpha11          NA            1               5
#24: sub1      Alpha11          NA            1               5
#25: sub1      Alpha11          NA            1               5
#26: sub1      Alpha11          NA            1               5
#27: sub1       Alpha1           9           NA              NA
#28: sub1       Alpha1           9           NA              NA
#29: sub1       Alpha1           9           NA              NA
#30: sub1       Alpha1           9           NA              NA
#31: sub1       Alpha1           9           NA              NA
#32: sub1       Alpha1           9           NA              NA
#33: sub1       Alpha1           9           NA              NA
#    Name StimulusName FixationSeq SaccadeCount SaccadeDuration

My approach uses a predefined function fun1 that does the job for each group. 我的方法使用预定义的功能fun1来为每个组完成工作。
The groups seem to be defined my Name and StimulusName 这些组似乎定义为我的Name和StimulusName
I use very important functions that you should learn about ?rle , ?rleidv 我用的，你应该了解非常重要的功能?rle ， ?rleidv
I prepopulate the new column with all NA -values, then I add the new values where needed. 我用所有NA值预填充新列，然后在需要的地方添加新值。

Answer 2

This should do it. 这应该做。 Maybe there is an easier way, though. 不过，也许有一种更简单的方法。 The first mutate indicates the start of an NA segment. 第一个突变指示NA片段的开始。 The group_by and the second mutate count the NA s for each segment. group_by和第二个突变计数每个段的NA 。

library(dplyr)
df %>% mutate(SaccadeCount = cumsum(ifelse(is.na(FixationSeq) & 
              !is.na(lag(FixationSeq)), 1,0)) * is.na(FixationSeq)) %>%
    group_by(SaccadeCount) %>%
    mutate(SaccadeDuration = n()) %>%
    ungroup() %>%
    mutate(SaccadeDuration = SaccadeDuration * is.na(FixationSeq))

Answer 3

Using dplyr : 使用dplyr ：

df %>%
  group_by(Name, StimulusName) %>%
  mutate(x = is.na(FixationSeq),
         count = cumsum(c(TRUE, diff(x) != 0L) & x) * x,
         dur = NA_integer_) %>%
  group_by(Name, StimulusName, count) %>%
  mutate(dur = replace(dur, as.logical(count), n()))

Corresponding (more succint) data.table version: 对应的（更data.table ） data.table版本：

library(data.table)
setDT(df)

df[ , count := ({
  x <- is.na(FixationSeq)
  .(cumsum(c(TRUE, diff(x) != 0L) & x) * x)}), by = .(Name, StimulusName)]

df[as.logical(count), dur := .N, by = .(Name, StimulusName, count)]

  Name StimulusName FixationSeq count dur 1: sub1 Alpha11 2 0 NA 2: sub1 Alpha11 2 0 NA 3: sub1 Alpha11 2 0 NA 4: sub1 Alpha11 2 0 NA 5: sub1 Alpha11 NA 1 4 6: sub1 Alpha11 NA 1 4 7: sub1 Alpha11 NA 1 4 8: sub1 Alpha11 NA 1 4 9: sub1 Alpha11 3 0 NA 10: sub1 Alpha11 3 0 NA 11: sub1 Alpha11 3 0 NA 12: sub1 Alpha11 3 0 NA 13: sub1 Alpha11 3 0 NA 14: sub1 Alpha11 NA 2 5 15: sub1 Alpha11 NA 2 5 16: sub1 Alpha11 NA 2 5 17: sub1 Alpha11 NA 2 5 18: sub1 Alpha11 NA 2 5 19: sub1 Alpha12 1 0 NA 20: sub1 Alpha12 NA 1 2 21: sub1 Alpha12 NA 1 2 22: sub1 Alpha12 2 0 NA 23: sub1 Alpha12 NA 2 1 24: sub2 Alpha11 NA 1 4 25: sub2 Alpha11 NA 1 4 26: sub2 Alpha11 NA 1 4 27: sub2 Alpha11 NA 1 4 28: sub2 Alpha11 2 0 NA 29: sub2 Alpha11 2 0 NA Name StimulusName FixationSeq count dur

If desired, change count == 0 to NA : 如果需要，将count == 0更改为NA ：

df[count == 0, count := NA]

I would not change it to 'blank' ( "" ), as shown in the question, because this would coerce the column to character and render the numbers useless for further analyses. 如问题所示，我不会将其更改为'blank'（ "" ），因为这将迫使该列具有character ，并使这些数字无法用于进一步的分析。

The cumsum(c(TRUE, diff(x) != 0L) & x) * x part step by step: cumsum(c(TRUE, diff(x) != 0L) & x) * x逐步说明：

v <- c(1, 1, NA, NA, 1, NA, NA, NA)
x <- is.na(v)
x
# [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

diff(x)
# [1]  0  1  0 -1  1  0  0

diff(x) != 0L
# [1] FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE

c(TRUE, diff(x) != 0L) & x
# [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE

cumsum(c(TRUE, diff(x) != 0L) & x)
# [1] 0 0 1 1 1 2 2 2

cumsum(c(TRUE, diff(x) != 0L) & x) * x
# [1] 0 0 1 1 0 2 2 2

The rest is hopefully rather straightforward. 其余的希望相当简单。

使用dplyr标记和计数值之间的差距

问题描述

3 个解决方案

解决方案1
2 2018-10-31 08:57:59

解决方案2
1 2018-10-31 08:43:19

解决方案3
1 已采纳 2018-10-31 15:38:48

使用dplyr标记和计数值之间的差距

问题描述

3 个解决方案

解决方案1 2 2018-10-31 08:57:59

解决方案2 1 2018-10-31 08:43:19

解决方案3 1 已采纳 2018-10-31 15:38:48

解决方案1
2 2018-10-31 08:57:59

解决方案2
1 2018-10-31 08:43:19

解决方案3
1 已采纳 2018-10-31 15:38:48