简体   繁体   English

创建一个新的因子/变量,其水平基于原始因子水平的“N”个连续出现

[英]Create a new factor/variable with levels based on “N” consecutive occurrences of original factor level

I've started a new project with a bunch of data management I've never had to do before, and I seemingly lack the skills or the appropriate search terms to find an example. 我已经开始了一个新项目,其中包含一系列我以前从未做过的数据管理,而且我似乎缺乏技能或适当的搜索条件来查找示例。 I have a very large data set with a grouping variable and a binary event variable. 我有一个非常大的数据集,其中包含分组变量和二进制事件变量。 It can be generalized to a working example as: 它可以推广到一个工作示例:

library('data.table')
grp <- c("a", "a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b")
v1 <- c(1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1)
test<-data.frame(grp,v1)
test

   grp v1
1    a  1
2    a  0
3    a  0
4    a  1
5    a  1
6    a  1
7    a  1
8    b  1
9    b  0
10   b  0
11   b  0
12   b  1

I've been using data.table to label streaks of "v1" within unique levels of "grp" as events with a new ordinal numeric factor "event": 我一直在使用data.table在“grp”的唯一级别data.table “v1”的条纹标记为具有新的序数数字因子“event”的事件:

setDT(test)
test<-test[, .(v1 = v1, event = rleidv(v1)), by=grp]

    grp v1 event
 1:   a  1     1
 2:   a  0     2
 3:   a  0     2
 4:   a  1     3
 5:   a  1     3
 6:   a  1     3
 7:   a  1     3
 8:   b  1     1
 9:   b  0     2
10:   b  0     2
11:   b  0     2
12:   b  1     3

In the actual data set some of these "grp" specific events are very long and I need to break them up into smaller, variable, n - limited events. 在实际数据集中,这些“grp”特定事件中的一些非常长,我需要将它们分解为更小的,可变的, n次有限的事件。 For example, my desired output for a new variable "sub.event" with n = 2 is: 例如,我对n = 2的新变量“sub.event”的所需输出是:

> test
    grp v1 event sub.event
 1:   a  1     1         1
 2:   a  0     2         2
 3:   a  0     2         2
 4:   a  1     3         3
 5:   a  1     3         3
 6:   a  1     3         4
 7:   a  1     3         4
 8:   b  1     1         1
 9:   b  0     2         2
10:   b  0     2         2
11:   b  0     2         3
12:   b  1     3         4

I've been pulling my hair out trying to figure out a way to do this. 我一直在拉着我的头发试图找到一种方法来做到这一点。 It seems simple enough that I must be missing something obvious. 看起来很简单,我必须遗漏一些明显的东西。 To help facilitate, the original variables can be concatenated into new variables before determining the n -limited "sub.event". 为了便于实现,在确定n- limited“sub.event”之前,可以将原始变量连接成新变量。

Thanks in advance for all your help. 在此先感谢您的帮助。

Here is a method that works with chaining. 这是一个与链接一起使用的方法。

setDT(test)[, new := rep(1:0, length.out=.N), by=.(grp, rleid(v1))][,
              new := cumsum(new), by=grp]

The first chain returns a vector of 1s and 0s repeating the length of the grp- rleid pair. 第一个链返回1s和0s的向量,重复grp- rleid对的长度。 The next link in the chain sums this up with cumsum by grp. 链中的下一个链接通过grp将其与cumsum

this returns 这回来了

test
    grp v1 new
 1:   a  1   1
 2:   a  0   2
 3:   a  0   2
 4:   a  1   3
 5:   a  1   3
 6:   a  1   4
 7:   a  1   4
 8:   b  1   1
 9:   b  0   2
10:   b  0   2
11:   b  0   3
12:   b  1   4

Note that as written, it doesn't automatically extend to n > 2. However, the piece that produces it, 1:0 could be written rep(c(1L, rep(0L, n)), length.out=.N) where n+1 is the number of repeated values that you'd want. 请注意,如上所述,它不会自动扩展到n> 2.但是,产生它的部分, 1:0可以写成rep(c(1L, rep(0L, n)), length.out=.N)其中n+1是您想要的重复值的数量。

In this case, the code would look like 在这种情况下,代码看起来像

test[, new := rep(c(1L, rep(0L, 2L)), length.out=.N), by=.(grp, rleid(v1))][,
       new := cumsum(new), by=grp]

Somewhat roundabout: 有点迂回:

# make counters within v1, grp
test[, v0 := rep(1:.N, each=2, length.out=.N), by=.(rleid(grp, v1))]

# make overall counters
test[, v := .GRP, by=rleid(grp, v1, v0)]

# difference per grp
test[, v := v - first(v) + 1L, by=grp]

# drop internal counter
test[, v0 := NULL ]

    grp v1 v
 1:   a  1 1
 2:   a  0 2
 3:   a  0 2
 4:   a  1 3
 5:   a  1 3
 6:   a  1 4
 7:   a  1 4
 8:   b  1 1
 9:   b  0 2
10:   b  0 2
11:   b  0 3
12:   b  1 4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM