[英]How to create groups based on conditions
I have this kind of data: 我有这种数据:
set.seed(12345)
df <- data.frame(group=rep(c("A"),26), size=c(rep(1000,5),rep(0,3),rep(1000,7),rep(0,3),rep(1000,5),rep(0,3)),
int=c(rnorm(3,5,1),rep(0,5),rnorm(3,5,1),rep(0,7),rnorm(3,5,1),rep(0,5)),
out=c(rep(0,5),rnorm(3,5,1),rep(0,7),rnorm(3,5,1),rep(0,5),rnorm(3,5,1)))
Here is desired output: 这是所需的输出:
group size int out id id2
1 A 1000 5.585529 0.000000 1 1
2 A 1000 5.709466 0.000000 1 1
3 A 1000 4.890697 0.000000 1 1
4 A 1000 0.000000 0.000000 1 1
5 A 1000 0.000000 0.000000 1 1
6 A 0 0.000000 4.080678 1 1
7 A 0 0.000000 4.883752 NA 1
8 A 0 0.000000 6.817312 NA 1
9 A 1000 4.546503 0.000000 2 2
10 A 1000 5.605887 0.000000 2 2
11 A 1000 3.182044 0.000000 2 2
12 A 1000 0.000000 0.000000 2 2
13 A 1000 0.000000 0.000000 2 2
14 A 1000 0.000000 0.000000 2 2
15 A 1000 0.000000 0.000000 2 2
16 A 0 0.000000 5.370628 2 2
17 A 0 0.000000 5.520216 NA 2
18 A 0 0.000000 4.249468 NA 2
19 A 1000 5.630099 0.000000 3 3
20 A 1000 4.723816 0.000000 3 3
21 A 1000 4.715840 0.000000 3 3
22 A 1000 0.000000 0.000000 3 3
23 A 1000 0.000000 0.000000 3 3
24 A 0 0.000000 5.816900 3 3
25 A 0 0.000000 4.113642 NA 3
26 A 0 0.000000 4.668422 NA 3
The new group id
is created based on the data above. 基于上面的数据创建新的组
id
。 I believe rle
function is the way to go, but I cannot figure it out to the end. 我相信要发挥作用是
rle
,但是我无法弄清楚到底是什么。
A variation on @ycw's answer: @ycw答案的变化形式:
library(data.table)
setDT(df)
df[, g := rleid( z <- out==0 | shift(out==0) )*NA^(!z) ]
group size int out g
1: A 1000 5.585529 0.000000 1
2: A 1000 5.709466 0.000000 1
3: A 1000 4.890697 0.000000 1
4: A 1000 0.000000 0.000000 1
5: A 1000 0.000000 0.000000 1
6: A 0 0.000000 4.080678 1
7: A 0 0.000000 4.883752 NA
8: A 0 0.000000 6.817312 NA
9: A 2000 4.546503 0.000000 3
10: A 2000 5.605887 0.000000 3
11: A 2000 3.182044 0.000000 3
12: A 2000 0.000000 0.000000 3
13: A 2000 0.000000 0.000000 3
14: A 2000 0.000000 0.000000 3
15: A 2000 0.000000 0.000000 3
16: A 0 0.000000 5.370628 3
17: A 0 0.000000 5.520216 NA
18: A 0 0.000000 4.249468 NA
19: A 5000 5.630099 0.000000 5
20: A 5000 4.723816 0.000000 5
21: A 5000 4.715840 0.000000 5
22: A 5000 0.000000 0.000000 5
23: A 5000 0.000000 0.000000 5
24: A 0 0.000000 5.816900 5
25: A 0 0.000000 4.113642 NA
26: A 0 0.000000 4.668422 NA
group size int out g
(@ycw suggested I make it a separate answer. Also, the NA^x
trick is borrowed from @akrun.) (@ycw建议我单独回答。此外,
NA^x
技巧是从@akrun借来的。)
For the OP's group numbers, this extra step works: 对于OP的组号,此额外步骤有效:
df[, g := match(g, unique(na.omit(g)))]
For the extension the OP added ("id2"): 对于扩展,添加了OP(“ id2”):
w = df[.(unique(na.omit(g))), on=.(g), which=TRUE, mult="first"]
df[, g2 := cumsum(.I %in% w)]
So in the end we have... 所以最后我们有...
group size int out g g2
1: A 1000 5.585529 0.000000 1 1
2: A 1000 5.709466 0.000000 1 1
3: A 1000 4.890697 0.000000 1 1
4: A 1000 0.000000 0.000000 1 1
5: A 1000 0.000000 0.000000 1 1
6: A 0 0.000000 4.080678 1 1
7: A 0 0.000000 4.883752 NA 1
8: A 0 0.000000 6.817312 NA 1
9: A 2000 4.546503 0.000000 2 2
10: A 2000 5.605887 0.000000 2 2
11: A 2000 3.182044 0.000000 2 2
12: A 2000 0.000000 0.000000 2 2
13: A 2000 0.000000 0.000000 2 2
14: A 2000 0.000000 0.000000 2 2
15: A 2000 0.000000 0.000000 2 2
16: A 0 0.000000 5.370628 2 2
17: A 0 0.000000 5.520216 NA 2
18: A 0 0.000000 4.249468 NA 2
19: A 5000 5.630099 0.000000 3 3
20: A 5000 4.723816 0.000000 3 3
21: A 5000 4.715840 0.000000 3 3
22: A 5000 0.000000 0.000000 3 3
23: A 5000 0.000000 0.000000 3 3
24: A 0 0.000000 5.816900 3 3
25: A 0 0.000000 4.113642 NA 3
26: A 0 0.000000 4.668422 NA 3
group size int out g g2
For base R analogues, there is an SO Q&A on how to make rleid
without data.table; 对于基本的R类似物,有一个关于如何在没有数据的情况下进行
rleid
解答。 shift
can be constructed manually (it's just a lag operator); shift
可以手动构建(它只是一个滞后运算符); and there are other ways to find w
(maybe tapply
?). 还有其他找到
w
(也许是tapply
?)。
Here is an option using dplyr
and the rleid
function from the data.table
package. 这是一个使用
dplyr
和data.table
包中的rleid
函数的data.table
。 dt2
is the final output. dt2
是最终输出。
library(dplyr)
library(data.table)
df2 <- df %>%
mutate(non_zero = ifelse(size != 0, 1, 0)) %>%
mutate(runID = rleid(non_zero)) %>%
mutate(runID = ifelse(runID %% 2 != 0, (runID + 1)/2, runID/2)) %>%
group_by(runID) %>%
mutate(id = ifelse(row_number() %in% n():(n() - 1), NA, runID)) %>%
ungroup() %>%
select(group, size, int, out, id, id2 = runID)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.