如何根据条件创建组

Question

I have this kind of data: 我有这种数据：

set.seed(12345)

df <- data.frame(group=rep(c("A"),26), size=c(rep(1000,5),rep(0,3),rep(1000,7),rep(0,3),rep(1000,5),rep(0,3)),
             int=c(rnorm(3,5,1),rep(0,5),rnorm(3,5,1),rep(0,7),rnorm(3,5,1),rep(0,5)),
             out=c(rep(0,5),rnorm(3,5,1),rep(0,7),rnorm(3,5,1),rep(0,5),rnorm(3,5,1)))

Here is desired output: 这是所需的输出：

   group size      int      out  id  id2
1      A 1000 5.585529 0.000000   1    1
2      A 1000 5.709466 0.000000   1    1
3      A 1000 4.890697 0.000000   1    1
4      A 1000 0.000000 0.000000   1    1
5      A 1000 0.000000 0.000000   1    1
6      A    0 0.000000 4.080678   1    1
7      A    0 0.000000 4.883752   NA   1
8      A    0 0.000000 6.817312   NA   1
9      A 1000 4.546503 0.000000   2    2
10     A 1000 5.605887 0.000000   2    2
11     A 1000 3.182044 0.000000   2    2
12     A 1000 0.000000 0.000000   2    2
13     A 1000 0.000000 0.000000   2    2
14     A 1000 0.000000 0.000000   2    2
15     A 1000 0.000000 0.000000   2    2
16     A    0 0.000000 5.370628   2    2
17     A    0 0.000000 5.520216   NA   2
18     A    0 0.000000 4.249468   NA   2
19     A 1000 5.630099 0.000000   3    3 
20     A 1000 4.723816 0.000000   3    3
21     A 1000 4.715840 0.000000   3    3
22     A 1000 0.000000 0.000000   3    3
23     A 1000 0.000000 0.000000   3    3
24     A    0 0.000000 5.816900   3    3
25     A    0 0.000000 4.113642   NA   3
26     A    0 0.000000 4.668422   NA   3

The new group id is created based on the data above. 基于上面的数据创建新的组id 。 I believe rle function is the way to go, but I cannot figure it out to the end. 我相信要发挥作用是rle ，但是我无法弄清楚到底是什么。

Answer 1

A variation on @ycw's answer: @ycw答案的变化形式：

library(data.table)
setDT(df)

df[, g := rleid( z <- out==0 | shift(out==0) )*NA^(!z) ]

    group size      int      out  g
 1:     A 1000 5.585529 0.000000  1
 2:     A 1000 5.709466 0.000000  1
 3:     A 1000 4.890697 0.000000  1
 4:     A 1000 0.000000 0.000000  1
 5:     A 1000 0.000000 0.000000  1
 6:     A    0 0.000000 4.080678  1
 7:     A    0 0.000000 4.883752 NA
 8:     A    0 0.000000 6.817312 NA
 9:     A 2000 4.546503 0.000000  3
10:     A 2000 5.605887 0.000000  3
11:     A 2000 3.182044 0.000000  3
12:     A 2000 0.000000 0.000000  3
13:     A 2000 0.000000 0.000000  3
14:     A 2000 0.000000 0.000000  3
15:     A 2000 0.000000 0.000000  3
16:     A    0 0.000000 5.370628  3
17:     A    0 0.000000 5.520216 NA
18:     A    0 0.000000 4.249468 NA
19:     A 5000 5.630099 0.000000  5
20:     A 5000 4.723816 0.000000  5
21:     A 5000 4.715840 0.000000  5
22:     A 5000 0.000000 0.000000  5
23:     A 5000 0.000000 0.000000  5
24:     A    0 0.000000 5.816900  5
25:     A    0 0.000000 4.113642 NA
26:     A    0 0.000000 4.668422 NA
    group size      int      out  g

(@ycw suggested I make it a separate answer. Also, the NA^x trick is borrowed from @akrun.) （@ycw建议我单独回答。此外， NA^x技巧是从@akrun借来的。）

For the OP's group numbers, this extra step works: 对于OP的组号，此额外步骤有效：

df[, g := match(g, unique(na.omit(g)))]

For the extension the OP added ("id2"): 对于扩展，添加了OP（“ id2”）：

w = df[.(unique(na.omit(g))), on=.(g), which=TRUE, mult="first"]
df[, g2 := cumsum(.I %in% w)]

So in the end we have... 所以最后我们有...

    group size      int      out  g g2
 1:     A 1000 5.585529 0.000000  1  1
 2:     A 1000 5.709466 0.000000  1  1
 3:     A 1000 4.890697 0.000000  1  1
 4:     A 1000 0.000000 0.000000  1  1
 5:     A 1000 0.000000 0.000000  1  1
 6:     A    0 0.000000 4.080678  1  1
 7:     A    0 0.000000 4.883752 NA  1
 8:     A    0 0.000000 6.817312 NA  1
 9:     A 2000 4.546503 0.000000  2  2
10:     A 2000 5.605887 0.000000  2  2
11:     A 2000 3.182044 0.000000  2  2
12:     A 2000 0.000000 0.000000  2  2
13:     A 2000 0.000000 0.000000  2  2
14:     A 2000 0.000000 0.000000  2  2
15:     A 2000 0.000000 0.000000  2  2
16:     A    0 0.000000 5.370628  2  2
17:     A    0 0.000000 5.520216 NA  2
18:     A    0 0.000000 4.249468 NA  2
19:     A 5000 5.630099 0.000000  3  3
20:     A 5000 4.723816 0.000000  3  3
21:     A 5000 4.715840 0.000000  3  3
22:     A 5000 0.000000 0.000000  3  3
23:     A 5000 0.000000 0.000000  3  3
24:     A    0 0.000000 5.816900  3  3
25:     A    0 0.000000 4.113642 NA  3
26:     A    0 0.000000 4.668422 NA  3
    group size      int      out  g g2

For base R analogues, there is an SO Q&A on how to make rleid without data.table; 对于基本的R类似物，有一个关于如何在没有数据的情况下进行rleid解答。 shift can be constructed manually (it's just a lag operator); shift可以手动构建（它只是一个滞后运算符）； and there are other ways to find w (maybe tapply ?). 还有其他找到w （也许是tapply ？）。

Answer 2

Here is an option using dplyr and the rleid function from the data.table package. 这是一个使用dplyr和data.table包中的rleid函数的data.table 。 dt2 is the final output. dt2是最终输出。

library(dplyr)
library(data.table)

df2 <- df %>%
  mutate(non_zero = ifelse(size != 0, 1, 0)) %>%
  mutate(runID = rleid(non_zero)) %>%
  mutate(runID = ifelse(runID %% 2 != 0, (runID + 1)/2, runID/2)) %>%
  group_by(runID) %>%
  mutate(id = ifelse(row_number() %in% n():(n() - 1), NA, runID)) %>%
  ungroup() %>%
  select(group, size, int, out, id, id2 = runID)

如何根据条件创建组

问题描述

2 个解决方案

解决方案1
3 已采纳 2017-08-21 18:48:35

解决方案2
2 2017-08-21 18:29:02

如何根据条件创建组

问题描述

2 个解决方案

解决方案1 3 已采纳 2017-08-21 18:48:35

解决方案2 2 2017-08-21 18:29:02

解决方案1
3 已采纳 2017-08-21 18:48:35

解决方案2
2 2017-08-21 18:29:02