简体   繁体   中英

Create sequential subgroup_ID within each group_ID depending on a column

I am struggling in finding the solution to a very simple task that needs to be run over 10 millions records.

Assuming the following data set:

mydf <- structure(list(group_ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 
4, 4, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 
7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 
9), element_index= c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 
12L, 13L, 14L, 15L, 16L, 17L, 18L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 1L, 2L, 3L, 4L, 5L, 6L, 
7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 
5L, 6L, 7L, 8L), value= c(8045762L, 259L, 155L, 167L, 
110L, 175L, 135L, 0L, 0L, 0L, 0L, 150L, 0L, 0L, 115L, 0L, 0L, 
396L, 11175L, 0L, 0L, 0L, 261L, 0L, 170L, 0L, 576L, 5807L, 0L, 
280L, 48663L, 0L, 0L, 497L, 7298L, 0L, 441L, 160725L, 0L, 0L, 
0L, 0L, 335L, 0L, 0L, 0L, 0L, 0L, 0L, 356L, 35462L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 265L, 0L, 0L, 360L, 780L, 0L, 0L, 0L, 371L, 48394L, 
0L, 0L, 0L, 341L, 0L, 0L, 386L)), .Names = c("group_ID", "element_index", 
"value"), class = "data.frame", row.names = c(NA, 75L))

Basically, the main concepts are that:
1. the first element element of each group_ID is always to subgroup_ID == 1 ,
2. elements with value == 0 must not be considered in increasing the subgroup_ID ;
3. the subgroup_id start from 1 at the second element with value != 0 and increase by 1 each time there is another value != 0 (starting from 1 at the second element with value != 0 );
4. element with value == 0 are associated to the first next element with value != 0 . Observing the picture , this means that element 2 and 3 are assigned to the subgroup_ID of element 4.

The solution is the following:

subgroup_ID = c(1,1,2,3,4,5,6,7,7,7,7,7,8,8,8,9,9,9,1,1,1,1,1,2,2,3,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,2,2,2,1,1,1,1,1,1,1,1,1,1,2,2,2)
solution_df <- data.frame(mydf, subgroup_ID)

The objective of this question is to give a subgroup_ID to divide each group in segments whereas the rule to create the subgroup_ID is the following:
- the first element of each group_ID is always 1
- the subgroup_ID increase by 1 each time there is an element with value != 0

group_ID == 1的示例

I hope the question was clear, please do not hesitate to ask for clarifications.

Here we are assuming that the rule for any group is to replace the second non-zero element of value with 0 and then form the result by starting with 1 and incrementing by 1 each time we encounter a subsequent non-zero.

Since the first element of value in each group is always non-zero according to the comment we can find the second non-zero by temporarily replacing the first element with zero and then searching for the first non-zero in what is left.

No packages are used.

Seq <- function(x) {
     x[head(which(replace(x, 1, 0) != 0), 1)] <- 0
     cumsum(x != 0)
}
transform(mydf, subid = ave(value, group_ID, FUN = Seq))

giving the same answer as shown in the question:

   group_ID element_index value subid
1         1             1   123     1
2         1             2     0     1
3         1             3     0     1
4         1             4   456     1
5         1             5   214     2
6         2             1    20     1
7         2             2     0     1
8         2             3    30     1
9         3             1    10     1
10        3             2     0     1
11        3             3    10     1
12        3             4    20     2

You can also try a tidyverse solution

library(tidyverse)
mydf %>% 
  group_by(group_ID) %>%
  mutate(value2=ifelse(row_number() == 1, 0, value)) %>% 
  mutate(subgroup_ID=lag(value2, default = 0) > 0) %>% 
  mutate(subgroup_ID=cumsum(subgroup_ID)+1) %>% 
  select(-value2)
# A tibble: 12 x 4
# Groups:   group_ID [3]
   group_ID element_index value subgroup_ID
      <dbl>         <dbl> <dbl>       <dbl>
 1        1             1   123           1
 2        1             2     0           1
 3        1             3     0           1
 4        1             4   456           1
 5        1             5   214           2
 6        2             1    20           1
 7        2             2     0           1
 8        2             3    30           1
 9        3             1    10           1
10        3             2     0           1
11        3             3    10           1
12        3             4    20           2
group_ID <- c(1,1,1,1,1,2,2,2,3,3,3,3)
element_index <- c(1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 3, 4)  #the element are ordered within each group_ID
value <- c(123, 0, 0, 456, 214, 20, 0, 30, 10, 0, 10, 20)
mydf <- data.frame(group_ID, element_index, value)


library(dplyr)

mydf %>%
  group_by(group_ID) %>%
  mutate(v_upd = cumsum(ifelse(value * lag(value, default = 0) != 0, 1, 0)) + 1) %>%
  ungroup()

# # A tibble: 12 x 4
#   group_ID element_index value v_upd
#      <dbl>         <dbl> <dbl> <dbl>
# 1        1             1   123     1
# 2        1             2     0     1
# 3        1             3     0     1
# 4        1             4   456     1
# 5        1             5   214     2
# 6        2             1    20     1
# 7        2             2     0     1
# 8        2             3    30     1
# 9        3             1    10     1
# 10       3             2     0     1
# 11       3             3    10     1
# 12       3             4    20     2

In order to better understand the process check this (similar) one that stores each step as a variable:

mydf %>%
  group_by(group_ID) %>%                             # for each group ID
  mutate(lag1_value = lag(value, default = 0)) %>%   # get the previous value of "value"
  mutate(v = ifelse(value * lag1_value != 0, 1, 0),  # for both current and previous value is different than 0 flag as 1
         v_upd = cumsum(v)+1) %>%                    # get cummulative sum of flags and add 1
  ungroup()                                          # forget the grouping

# # A tibble: 12 x 6
#   group_ID element_index value lag1_value     v v_upd
#      <dbl>         <dbl> <dbl>      <dbl> <dbl> <dbl>
# 1        1             1   123          0     0     1
# 2        1             2     0        123     0     1
# 3        1             3     0          0     0     1
# 4        1             4   456          0     0     1
# 5        1             5   214        456     1     2
# 6        2             1    20          0     0     1
# 7        2             2     0         20     0     1
# 8        2             3    30          0     0     1
# 9        3             1    10          0     0     1
# 10       3             2     0         10     0     1
# 11       3             3    10          0     0     1
# 12       3             4    20         10     1     2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM