简体   繁体   English

如何根据条件创建组

[英]How to create groups based on conditions

I have this kind of data: 我有这种数据:

set.seed(12345)

df <- data.frame(group=rep(c("A"),26), size=c(rep(1000,5),rep(0,3),rep(1000,7),rep(0,3),rep(1000,5),rep(0,3)),
             int=c(rnorm(3,5,1),rep(0,5),rnorm(3,5,1),rep(0,7),rnorm(3,5,1),rep(0,5)),
             out=c(rep(0,5),rnorm(3,5,1),rep(0,7),rnorm(3,5,1),rep(0,5),rnorm(3,5,1)))         

Here is desired output: 这是所需的输出:

   group size      int      out  id  id2
1      A 1000 5.585529 0.000000   1    1
2      A 1000 5.709466 0.000000   1    1
3      A 1000 4.890697 0.000000   1    1
4      A 1000 0.000000 0.000000   1    1
5      A 1000 0.000000 0.000000   1    1
6      A    0 0.000000 4.080678   1    1
7      A    0 0.000000 4.883752   NA   1
8      A    0 0.000000 6.817312   NA   1
9      A 1000 4.546503 0.000000   2    2
10     A 1000 5.605887 0.000000   2    2
11     A 1000 3.182044 0.000000   2    2
12     A 1000 0.000000 0.000000   2    2
13     A 1000 0.000000 0.000000   2    2
14     A 1000 0.000000 0.000000   2    2
15     A 1000 0.000000 0.000000   2    2
16     A    0 0.000000 5.370628   2    2
17     A    0 0.000000 5.520216   NA   2
18     A    0 0.000000 4.249468   NA   2
19     A 1000 5.630099 0.000000   3    3 
20     A 1000 4.723816 0.000000   3    3
21     A 1000 4.715840 0.000000   3    3
22     A 1000 0.000000 0.000000   3    3
23     A 1000 0.000000 0.000000   3    3
24     A    0 0.000000 5.816900   3    3
25     A    0 0.000000 4.113642   NA   3
26     A    0 0.000000 4.668422   NA   3

The new group id is created based on the data above. 基于上面的数据创建新的组id I believe rle function is the way to go, but I cannot figure it out to the end. 我相信要发挥作用是rle ,但是我无法弄清楚到底是什么。

A variation on @ycw's answer: @ycw答案的变化形式:

library(data.table)
setDT(df)

df[, g := rleid( z <- out==0 | shift(out==0) )*NA^(!z) ]

    group size      int      out  g
 1:     A 1000 5.585529 0.000000  1
 2:     A 1000 5.709466 0.000000  1
 3:     A 1000 4.890697 0.000000  1
 4:     A 1000 0.000000 0.000000  1
 5:     A 1000 0.000000 0.000000  1
 6:     A    0 0.000000 4.080678  1
 7:     A    0 0.000000 4.883752 NA
 8:     A    0 0.000000 6.817312 NA
 9:     A 2000 4.546503 0.000000  3
10:     A 2000 5.605887 0.000000  3
11:     A 2000 3.182044 0.000000  3
12:     A 2000 0.000000 0.000000  3
13:     A 2000 0.000000 0.000000  3
14:     A 2000 0.000000 0.000000  3
15:     A 2000 0.000000 0.000000  3
16:     A    0 0.000000 5.370628  3
17:     A    0 0.000000 5.520216 NA
18:     A    0 0.000000 4.249468 NA
19:     A 5000 5.630099 0.000000  5
20:     A 5000 4.723816 0.000000  5
21:     A 5000 4.715840 0.000000  5
22:     A 5000 0.000000 0.000000  5
23:     A 5000 0.000000 0.000000  5
24:     A    0 0.000000 5.816900  5
25:     A    0 0.000000 4.113642 NA
26:     A    0 0.000000 4.668422 NA
    group size      int      out  g

(@ycw suggested I make it a separate answer. Also, the NA^x trick is borrowed from @akrun.) (@ycw建议我单独回答。此外, NA^x技巧是从@akrun借来的。)

For the OP's group numbers, this extra step works: 对于OP的组号,此额外步骤有效:

df[, g := match(g, unique(na.omit(g)))]

For the extension the OP added ("id2"): 对于扩展,添加了OP(“ id2”):

w = df[.(unique(na.omit(g))), on=.(g), which=TRUE, mult="first"]
df[, g2 := cumsum(.I %in% w)]

So in the end we have... 所以最后我们有...

    group size      int      out  g g2
 1:     A 1000 5.585529 0.000000  1  1
 2:     A 1000 5.709466 0.000000  1  1
 3:     A 1000 4.890697 0.000000  1  1
 4:     A 1000 0.000000 0.000000  1  1
 5:     A 1000 0.000000 0.000000  1  1
 6:     A    0 0.000000 4.080678  1  1
 7:     A    0 0.000000 4.883752 NA  1
 8:     A    0 0.000000 6.817312 NA  1
 9:     A 2000 4.546503 0.000000  2  2
10:     A 2000 5.605887 0.000000  2  2
11:     A 2000 3.182044 0.000000  2  2
12:     A 2000 0.000000 0.000000  2  2
13:     A 2000 0.000000 0.000000  2  2
14:     A 2000 0.000000 0.000000  2  2
15:     A 2000 0.000000 0.000000  2  2
16:     A    0 0.000000 5.370628  2  2
17:     A    0 0.000000 5.520216 NA  2
18:     A    0 0.000000 4.249468 NA  2
19:     A 5000 5.630099 0.000000  3  3
20:     A 5000 4.723816 0.000000  3  3
21:     A 5000 4.715840 0.000000  3  3
22:     A 5000 0.000000 0.000000  3  3
23:     A 5000 0.000000 0.000000  3  3
24:     A    0 0.000000 5.816900  3  3
25:     A    0 0.000000 4.113642 NA  3
26:     A    0 0.000000 4.668422 NA  3
    group size      int      out  g g2

For base R analogues, there is an SO Q&A on how to make rleid without data.table; 对于基本的R类似物,有一个关于如何在没有数据的情况下进行rleid解答。 shift can be constructed manually (it's just a lag operator); shift可以手动构建(它只是一个滞后运算符); and there are other ways to find w (maybe tapply ?). 还有其他找到w (也许是tapply ?)。

Here is an option using dplyr and the rleid function from the data.table package. 这是一个使用dplyrdata.table包中的rleid函数的data.table dt2 is the final output. dt2是最终输出。

library(dplyr)
library(data.table)

df2 <- df %>%
  mutate(non_zero = ifelse(size != 0, 1, 0)) %>%
  mutate(runID = rleid(non_zero)) %>%
  mutate(runID = ifelse(runID %% 2 != 0, (runID + 1)/2, runID/2)) %>%
  group_by(runID) %>%
  mutate(id = ifelse(row_number() %in% n():(n() - 1), NA, runID)) %>%
  ungroup() %>%
  select(group, size, int, out, id, id2 = runID)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM