[英]Using dplyr to label and count gaps between values
I have this dataframe: 我有这个数据框:
df<-structure(list(Name = c("sub1", "sub1", "sub1", "sub1", "sub1",
"sub1", "sub1", "sub1", "sub1", "sub1", "sub1", "sub1", "sub1",
"sub1", "sub1", "sub1", "sub1", "sub1", "sub1", "sub1", "sub1",
"sub1", "sub1", "sub2", "sub2", "sub2", "sub2", "sub2", "sub2"
), StimulusName = c("Alpha11", "Alpha11", "Alpha11", "Alpha11",
"Alpha11", "Alpha11", "Alpha11", "Alpha11", "Alpha11", "Alpha11",
"Alpha11", "Alpha11", "Alpha11", "Alpha11", "Alpha11", "Alpha11",
"Alpha11", "Alpha11", "Alpha12", "Alpha12", "Alpha12", "Alpha12",
"Alpha12", "Alpha11", "Alpha11", "Alpha11", "Alpha11", "Alpha11",
"Alpha11"), FixationSeq = c(2L, 2L, 2L, 2L, NA, NA, NA, NA, 3L,
3L, 3L, 3L, 3L, NA, NA, NA, NA, NA, 1L, NA, NA, 2L, NA, NA, NA,
NA, NA, 2L, 2L)), row.names = c(NA, -29L), class = c("tbl_df",
"tbl", "data.frame"), spec = structure(list(cols = list(Name = structure(list(), class = c("collector_character",
"collector")), StimulusName = structure(list(), class = c("collector_character",
"collector")), FixationSeq = structure(list(), class = c("collector_integer",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector"))), class = "col_spec"))
In the column FixationSeq
there are unique numbers (in my example 2 and 3 for Name
= sub1
and StimulusName
= Alpha11
). 在FixationSeq
列中有唯一编号(在我的示例2和3中, Name
= sub1
和StimulusName
= Alpha11
)。 Between these numbers there are segments filled with NA
. 在这些数字之间NA
填充的段。 There is also a segment after 3 filled with NA
. 3之后还有一个用NA
填充的段。
I would like to be able create a new column SaccadeCount
and add an incrementing numerical label to every instance of an NA
segment (as a whole, ie potentially multiple rows) to the relevant rows in SaccadeCount
. 我希望能够创建一个新列SaccadeCount
并将一个递增的数字标签添加到NA
段的每个实例(作为一个整体,即可能是多行)到SaccadeCount
的相关行。
Additionally, I'd like to have another column called SaccadeDuration
and total the number of rows where unique segments of NA
appear. 另外,我想再有一个名为SaccadeDuration
的列,并SaccadeDuration
出现NA
唯一段的总行数。 So in the example df
the rows corresponding to the NA
segment between 2 and 3 would be populated with '3' since that is the total number of rows between 2 and 3. 因此在示例df
,与2和3之间的NA
段相对应的行将填充为'3',因为那是2和3之间的行的总数。
I would like to accomplish this using dplyr and group the operation by the columns Name
and StimulusName
. 我想使用dplyr完成此操作,并按Name
和StimulusName
列对操作进行分组。
An output might look something like this: 输出可能看起来像这样:
Name StimulusName FixationSeq SaccadeCount SaccadeDuration
sub1 Alpha11 2
sub1 Alpha11 2
sub1 Alpha11 2
sub1 Alpha11 2
sub1 Alpha11 NA 1 3
sub1 Alpha11 NA 1 3
sub1 Alpha11 NA 1 3
sub1 Alpha11 3
sub1 Alpha11 3
sub1 Alpha11 3
sub1 Alpha11 3
sub1 Alpha11 3
sub1 Alpha11 3
sub1 Alpha11 NA 2 5
sub1 Alpha11 NA 2 5
sub1 Alpha11 NA 2 5
sub1 Alpha11 NA 2 5
sub1 Alpha11 NA 2 5
sub1 Alpha12 1
sub1 Alpha12 NA 1 2
sub1 Alpha12 NA 1 2
sub1 Alpha12 2
sub1 Alpha12 NA 2 1
sub2 Alpha11 NA 1 4
sub2 Alpha11 NA 1 4
sub2 Alpha11 NA 1 4
sub2 Alpha11 NA 1 4
sub2 Alpha11 2
sub2 Alpha11 2
Thank you very much for your time and help. 非常感谢您的时间和帮助。
Using data.table
使用data.table
code: 码:
library(data.table)
fun1 <- function(x) {
na.ind = is.na(x$FixationSeq)
na.vals= rleidv(rleidv(na.ind)[na.ind])
x$SaccadeCount = NA
x$SaccadeCount[na.ind] = na.vals
na.rle = rle(na.vals)
x$SaccadeDuration = NA
x$SaccadeDuration[na.ind] = rep(na.rle$lengths, na.rle$lengths)
return(x)
}
setDT(df)[, fun1(.SD) ,by = .(Name, StimulusName)]
You can use fun1
in a dplyr fashion: 您可以以dplyr方式使用fun1
:
ans<-
df %>% group_by(Name, StimulusName) %>% dplyr::do(.data = ., fun1(.))
result: 结果:
# Name StimulusName FixationSeq SaccadeCount SaccadeDuration
#1: sub1 Alpha11 2 NA NA
#2: sub1 Alpha11 2 NA NA
#3: sub1 Alpha11 2 NA NA
#4: sub1 Alpha11 2 NA NA
#5: sub1 Alpha11 2 NA NA
#6: sub1 Alpha11 2 NA NA
#7: sub1 Alpha11 2 NA NA
#8: sub1 Alpha11 2 NA NA
#9: sub1 Alpha11 2 NA NA
#10: sub1 Alpha11 2 NA NA
#11: sub1 Alpha11 2 NA NA
#12: sub1 Alpha11 2 NA NA
#13: sub1 Alpha11 2 NA NA
#14: sub1 Alpha11 2 NA NA
#15: sub1 Alpha11 2 NA NA
#16: sub1 Alpha11 2 NA NA
#17: sub1 Alpha11 2 NA NA
#18: sub1 Alpha11 2 NA NA
#19: sub1 Alpha11 2 NA NA
#20: sub1 Alpha11 2 NA NA
#21: sub1 Alpha11 2 NA NA
#22: sub1 Alpha11 NA 1 5
#23: sub1 Alpha11 NA 1 5
#24: sub1 Alpha11 NA 1 5
#25: sub1 Alpha11 NA 1 5
#26: sub1 Alpha11 NA 1 5
#27: sub1 Alpha1 9 NA NA
#28: sub1 Alpha1 9 NA NA
#29: sub1 Alpha1 9 NA NA
#30: sub1 Alpha1 9 NA NA
#31: sub1 Alpha1 9 NA NA
#32: sub1 Alpha1 9 NA NA
#33: sub1 Alpha1 9 NA NA
# Name StimulusName FixationSeq SaccadeCount SaccadeDuration
fun1
that does the job for each group. 我的方法使用预定义的功能fun1
来为每个组完成工作。 Name
and StimulusName
这些组似乎定义为我的Name
和StimulusName
?rle
, ?rleidv
我用的,你应该了解非常重要的功能?rle
, ?rleidv
NA
-values, then I add the new values where needed. 我用所有NA
值预填充新列,然后在需要的地方添加新值。 This should do it. 这应该做。 Maybe there is an easier way, though. 不过,也许有一种更简单的方法。 The first mutate indicates the start of an NA segment. 第一个突变指示NA片段的开始。 The group_by and the second mutate count the NA s for each segment. group_by和第二个突变计数每个段的NA 。
library(dplyr)
df %>% mutate(SaccadeCount = cumsum(ifelse(is.na(FixationSeq) &
!is.na(lag(FixationSeq)), 1,0)) * is.na(FixationSeq)) %>%
group_by(SaccadeCount) %>%
mutate(SaccadeDuration = n()) %>%
ungroup() %>%
mutate(SaccadeDuration = SaccadeDuration * is.na(FixationSeq))
Using dplyr
: 使用dplyr
:
df %>%
group_by(Name, StimulusName) %>%
mutate(x = is.na(FixationSeq),
count = cumsum(c(TRUE, diff(x) != 0L) & x) * x,
dur = NA_integer_) %>%
group_by(Name, StimulusName, count) %>%
mutate(dur = replace(dur, as.logical(count), n()))
Corresponding (more succint) data.table
version: 对应的(更data.table
) data.table
版本:
library(data.table)
setDT(df)
df[ , count := ({
x <- is.na(FixationSeq)
.(cumsum(c(TRUE, diff(x) != 0L) & x) * x)}), by = .(Name, StimulusName)]
df[as.logical(count), dur := .N, by = .(Name, StimulusName, count)]
Name StimulusName FixationSeq count dur 1: sub1 Alpha11 2 0 NA 2: sub1 Alpha11 2 0 NA 3: sub1 Alpha11 2 0 NA 4: sub1 Alpha11 2 0 NA 5: sub1 Alpha11 NA 1 4 6: sub1 Alpha11 NA 1 4 7: sub1 Alpha11 NA 1 4 8: sub1 Alpha11 NA 1 4 9: sub1 Alpha11 3 0 NA 10: sub1 Alpha11 3 0 NA 11: sub1 Alpha11 3 0 NA 12: sub1 Alpha11 3 0 NA 13: sub1 Alpha11 3 0 NA 14: sub1 Alpha11 NA 2 5 15: sub1 Alpha11 NA 2 5 16: sub1 Alpha11 NA 2 5 17: sub1 Alpha11 NA 2 5 18: sub1 Alpha11 NA 2 5 19: sub1 Alpha12 1 0 NA 20: sub1 Alpha12 NA 1 2 21: sub1 Alpha12 NA 1 2 22: sub1 Alpha12 2 0 NA 23: sub1 Alpha12 NA 2 1 24: sub2 Alpha11 NA 1 4 25: sub2 Alpha11 NA 1 4 26: sub2 Alpha11 NA 1 4 27: sub2 Alpha11 NA 1 4 28: sub2 Alpha11 2 0 NA 29: sub2 Alpha11 2 0 NA Name StimulusName FixationSeq count dur
If desired, change count == 0
to NA
: 如果需要,将count == 0
更改为NA
:
df[count == 0, count := NA]
I would not change it to 'blank' ( ""
), as shown in the question, because this would coerce the column to character
and render the numbers useless for further analyses. 如问题所示,我不会将其更改为'blank'( ""
),因为这将迫使该列具有character
,并使这些数字无法用于进一步的分析。
The cumsum(c(TRUE, diff(x) != 0L) & x) * x
part step by step: cumsum(c(TRUE, diff(x) != 0L) & x) * x
逐步说明:
v <- c(1, 1, NA, NA, 1, NA, NA, NA)
x <- is.na(v)
x
# [1] FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE
diff(x)
# [1] 0 1 0 -1 1 0 0
diff(x) != 0L
# [1] FALSE TRUE FALSE TRUE TRUE FALSE FALSE
c(TRUE, diff(x) != 0L) & x
# [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
cumsum(c(TRUE, diff(x) != 0L) & x)
# [1] 0 0 1 1 1 2 2 2
cumsum(c(TRUE, diff(x) != 0L) & x) * x
# [1] 0 0 1 1 0 2 2 2
The rest is hopefully rather straightforward. 其余的希望相当简单。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.