I am struggling in finding the solution to a very simple task that needs to be run over 10 millions records.
Assuming the following data set:
mydf <- structure(list(group_ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4,
4, 4, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9,
9), element_index= c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L,
12L, 13L, 14L, 15L, 16L, 17L, 18L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L), value= c(8045762L, 259L, 155L, 167L,
110L, 175L, 135L, 0L, 0L, 0L, 0L, 150L, 0L, 0L, 115L, 0L, 0L,
396L, 11175L, 0L, 0L, 0L, 261L, 0L, 170L, 0L, 576L, 5807L, 0L,
280L, 48663L, 0L, 0L, 497L, 7298L, 0L, 441L, 160725L, 0L, 0L,
0L, 0L, 335L, 0L, 0L, 0L, 0L, 0L, 0L, 356L, 35462L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 265L, 0L, 0L, 360L, 780L, 0L, 0L, 0L, 371L, 48394L,
0L, 0L, 0L, 341L, 0L, 0L, 386L)), .Names = c("group_ID", "element_index",
"value"), class = "data.frame", row.names = c(NA, 75L))
Basically, the main concepts are that:
1. the first element element of each group_ID is always to subgroup_ID == 1
,
2. elements with value == 0
must not be considered in increasing the subgroup_ID
;
3. the subgroup_id
start from 1
at the second element with value != 0
and increase by 1
each time there is another value != 0
(starting from 1 at the second element with value != 0
);
4. element with value == 0
are associated to the first next element with value != 0
. Observing the picture , this means that element 2 and 3 are assigned to the subgroup_ID of element 4.
The solution is the following:
subgroup_ID = c(1,1,2,3,4,5,6,7,7,7,7,7,8,8,8,9,9,9,1,1,1,1,1,2,2,3,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,2,2,2,1,1,1,1,1,1,1,1,1,1,2,2,2)
solution_df <- data.frame(mydf, subgroup_ID)
The objective of this question is to give a subgroup_ID
to divide each group in segments whereas the rule to create the subgroup_ID
is the following:
- the first element of each group_ID
is always 1
- the subgroup_ID
increase by 1
each time there is an element with value != 0
I hope the question was clear, please do not hesitate to ask for clarifications.
Here we are assuming that the rule for any group is to replace the second non-zero element of value with 0 and then form the result by starting with 1 and incrementing by 1 each time we encounter a subsequent non-zero.
Since the first element of value in each group is always non-zero according to the comment we can find the second non-zero by temporarily replacing the first element with zero and then searching for the first non-zero in what is left.
No packages are used.
Seq <- function(x) {
x[head(which(replace(x, 1, 0) != 0), 1)] <- 0
cumsum(x != 0)
}
transform(mydf, subid = ave(value, group_ID, FUN = Seq))
giving the same answer as shown in the question:
group_ID element_index value subid
1 1 1 123 1
2 1 2 0 1
3 1 3 0 1
4 1 4 456 1
5 1 5 214 2
6 2 1 20 1
7 2 2 0 1
8 2 3 30 1
9 3 1 10 1
10 3 2 0 1
11 3 3 10 1
12 3 4 20 2
You can also try a tidyverse
solution
library(tidyverse)
mydf %>%
group_by(group_ID) %>%
mutate(value2=ifelse(row_number() == 1, 0, value)) %>%
mutate(subgroup_ID=lag(value2, default = 0) > 0) %>%
mutate(subgroup_ID=cumsum(subgroup_ID)+1) %>%
select(-value2)
# A tibble: 12 x 4
# Groups: group_ID [3]
group_ID element_index value subgroup_ID
<dbl> <dbl> <dbl> <dbl>
1 1 1 123 1
2 1 2 0 1
3 1 3 0 1
4 1 4 456 1
5 1 5 214 2
6 2 1 20 1
7 2 2 0 1
8 2 3 30 1
9 3 1 10 1
10 3 2 0 1
11 3 3 10 1
12 3 4 20 2
group_ID <- c(1,1,1,1,1,2,2,2,3,3,3,3)
element_index <- c(1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 3, 4) #the element are ordered within each group_ID
value <- c(123, 0, 0, 456, 214, 20, 0, 30, 10, 0, 10, 20)
mydf <- data.frame(group_ID, element_index, value)
library(dplyr)
mydf %>%
group_by(group_ID) %>%
mutate(v_upd = cumsum(ifelse(value * lag(value, default = 0) != 0, 1, 0)) + 1) %>%
ungroup()
# # A tibble: 12 x 4
# group_ID element_index value v_upd
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 123 1
# 2 1 2 0 1
# 3 1 3 0 1
# 4 1 4 456 1
# 5 1 5 214 2
# 6 2 1 20 1
# 7 2 2 0 1
# 8 2 3 30 1
# 9 3 1 10 1
# 10 3 2 0 1
# 11 3 3 10 1
# 12 3 4 20 2
In order to better understand the process check this (similar) one that stores each step as a variable:
mydf %>%
group_by(group_ID) %>% # for each group ID
mutate(lag1_value = lag(value, default = 0)) %>% # get the previous value of "value"
mutate(v = ifelse(value * lag1_value != 0, 1, 0), # for both current and previous value is different than 0 flag as 1
v_upd = cumsum(v)+1) %>% # get cummulative sum of flags and add 1
ungroup() # forget the grouping
# # A tibble: 12 x 6
# group_ID element_index value lag1_value v v_upd
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 123 0 0 1
# 2 1 2 0 123 0 1
# 3 1 3 0 0 0 1
# 4 1 4 456 0 0 1
# 5 1 5 214 456 1 2
# 6 2 1 20 0 0 1
# 7 2 2 0 20 0 1
# 8 2 3 30 0 0 1
# 9 3 1 10 0 0 1
# 10 3 2 0 10 0 1
# 11 3 3 10 0 0 1
# 12 3 4 20 10 1 2
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.