[英]Create a new factor/variable with levels based on “N” consecutive occurrences of original factor level
I've started a new project with a bunch of data management I've never had to do before, and I seemingly lack the skills or the appropriate search terms to find an example. 我已经开始了一个新项目,其中包含一系列我以前从未做过的数据管理,而且我似乎缺乏技能或适当的搜索条件来查找示例。 I have a very large data set with a grouping variable and a binary event variable.
我有一个非常大的数据集,其中包含分组变量和二进制事件变量。 It can be generalized to a working example as:
它可以推广到一个工作示例:
library('data.table')
grp <- c("a", "a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b")
v1 <- c(1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1)
test<-data.frame(grp,v1)
test
grp v1
1 a 1
2 a 0
3 a 0
4 a 1
5 a 1
6 a 1
7 a 1
8 b 1
9 b 0
10 b 0
11 b 0
12 b 1
I've been using data.table
to label streaks of "v1" within unique levels of "grp" as events with a new ordinal numeric factor "event": 我一直在使用
data.table
在“grp”的唯一级别data.table
“v1”的条纹标记为具有新的序数数字因子“event”的事件:
setDT(test)
test<-test[, .(v1 = v1, event = rleidv(v1)), by=grp]
grp v1 event
1: a 1 1
2: a 0 2
3: a 0 2
4: a 1 3
5: a 1 3
6: a 1 3
7: a 1 3
8: b 1 1
9: b 0 2
10: b 0 2
11: b 0 2
12: b 1 3
In the actual data set some of these "grp" specific events are very long and I need to break them up into smaller, variable, n - limited events. 在实际数据集中,这些“grp”特定事件中的一些非常长,我需要将它们分解为更小的,可变的, n次有限的事件。 For example, my desired output for a new variable "sub.event" with n = 2 is:
例如,我对n = 2的新变量“sub.event”的所需输出是:
> test
grp v1 event sub.event
1: a 1 1 1
2: a 0 2 2
3: a 0 2 2
4: a 1 3 3
5: a 1 3 3
6: a 1 3 4
7: a 1 3 4
8: b 1 1 1
9: b 0 2 2
10: b 0 2 2
11: b 0 2 3
12: b 1 3 4
I've been pulling my hair out trying to figure out a way to do this. 我一直在拉着我的头发试图找到一种方法来做到这一点。 It seems simple enough that I must be missing something obvious.
看起来很简单,我必须遗漏一些明显的东西。 To help facilitate, the original variables can be concatenated into new variables before determining the n -limited "sub.event".
为了便于实现,在确定n- limited“sub.event”之前,可以将原始变量连接成新变量。
Thanks in advance for all your help. 在此先感谢您的帮助。
Here is a method that works with chaining. 这是一个与链接一起使用的方法。
setDT(test)[, new := rep(1:0, length.out=.N), by=.(grp, rleid(v1))][,
new := cumsum(new), by=grp]
The first chain returns a vector of 1s and 0s repeating the length of the grp- rleid
pair. 第一个链返回1s和0s的向量,重复grp-
rleid
对的长度。 The next link in the chain sums this up with cumsum
by grp. 链中的下一个链接通过grp将其与
cumsum
。
this returns 这回来了
test
grp v1 new
1: a 1 1
2: a 0 2
3: a 0 2
4: a 1 3
5: a 1 3
6: a 1 4
7: a 1 4
8: b 1 1
9: b 0 2
10: b 0 2
11: b 0 3
12: b 1 4
Note that as written, it doesn't automatically extend to n > 2. However, the piece that produces it, 1:0
could be written rep(c(1L, rep(0L, n)), length.out=.N)
where n+1
is the number of repeated values that you'd want. 请注意,如上所述,它不会自动扩展到n> 2.但是,产生它的部分,
1:0
可以写成rep(c(1L, rep(0L, n)), length.out=.N)
其中n+1
是您想要的重复值的数量。
In this case, the code would look like 在这种情况下,代码看起来像
test[, new := rep(c(1L, rep(0L, 2L)), length.out=.N), by=.(grp, rleid(v1))][,
new := cumsum(new), by=grp]
Somewhat roundabout: 有点迂回:
# make counters within v1, grp
test[, v0 := rep(1:.N, each=2, length.out=.N), by=.(rleid(grp, v1))]
# make overall counters
test[, v := .GRP, by=rleid(grp, v1, v0)]
# difference per grp
test[, v := v - first(v) + 1L, by=grp]
# drop internal counter
test[, v0 := NULL ]
grp v1 v
1: a 1 1
2: a 0 2
3: a 0 2
4: a 1 3
5: a 1 3
6: a 1 4
7: a 1 4
8: b 1 1
9: b 0 2
10: b 0 2
11: b 0 3
12: b 1 4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.