简体   繁体   中英

Repeating variable in group by category in R

I have data for which I would like to create a new variable: flag
Data is set in a longitudinal format with repetition of id and have associated dates.

The other two important variables are category and company .
Category: for each id there will be at least one category "a" and "b" , but most of the times there will be multiple "a" and "b". Company: There could be multiple company for the same ids. Sometime category "b" would have the same company as category "a" for a particular id. Here for ease I have included only three company as x, y, z.

I want to create a flag. So that when group_by id

  1. if there is at least one instance of same company launching product in category "b" and "a". Then flag the "a" with same product as "rp" (repeating product)*.
  2. If not, than flag all the corresponding "a" as "nr" (no repeating product in b).
  3. For the "b" if there is a corresponding "rp". I want to sequence all the "b" with same company as "a" based on date such as p1, p2, p3,... (if the date of product for same then it could be p1, p1, p2,..), and for the remaining "b" with no same company as "p0"
  4. For the "b" with corresponding "a" as "nr" we can again call them as "p0"

Below is the data frame with the flag variable(expected output)

id<- c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,5,5,5)
date<- as.Date(c("2001-01-04", "2007-09-23", "2008-11-14",
                 "2009-11-13", "2012-07-21", "2014-09-15",
                 "2000-04-01", "2008-07-14", "2008-07-14", 
                 "2001-03-21", "2019-05-23", "2019-05-08", 
                 "2004-07-06", "2007-08-12", "2011-09-20", 
                 "2011-09-20", "2014-08-15", "2014-08-15"))
category<- c("a", "b", "b", "a", "b", "b", "a", "b", "b",
           "a", "b", "b", "a", "a", "b", "b", "b", "b")
company<-c("x", "x", "x", "x", "y", "y", "x", "x", "x",
           "x", "y", "z", "x", "x", "x", "x",  "x", "y")
flag<-c ("rp","p1", "p2", "nr", "p0", "p0", "rp", "p1",
         "p1", "nr", "p0", "p0", "rp", "rp", "p1", "p1", 
         "p2", "p0")
dfx <- data.frame(id, date, category, company, flag)

One possible approach with tidyverse , if I understand the logic correctly. After grouping by both id and company , you can see if both categories "a" and "b" are present; if so, mark those rows where category is "a" with "rp".

A more convoluted case_when can consider your different rules, but leave as missing NA situations where you need "p" with a sequence of numbers. A temporary column including a counter can be made based on these missing values to give you "p1", "p2", etc.

library(tidyverse)

dfx %>%
  group_by(id, company) %>%
  mutate(new_flag = case_when(
    all(c("a", "b") %in% category) & category == "a" ~ "rp",
    category == "a" ~ "nr",
    TRUE ~ NA_character_)) %>%
  group_by(id) %>%
  mutate(new_flag = case_when(
    category == "b" & new_flag[category == "a"][1] == "nr" ~ "p0", 
    category == "b" & new_flag[category == "a"][1] == "rp" &
      company == company[category == "a"][1] ~ NA_character_,
    category == "b" & new_flag[category == "a"][1] == "rp" &
      company != company[category == "a"][1] ~ "p0",
    TRUE ~ new_flag)) %>%
  group_by(id, company) %>%
  mutate(ctr = cumsum(is.na(new_flag) & date != lag(date, default = first(date[is.na(new_flag)])))) %>%
  mutate(new_flag = ifelse(is.na(new_flag), paste0("p", ctr), new_flag)) %>%
  select(-ctr)

Output

      id date       category company flag  new_flag
   <dbl> <date>     <chr>    <chr>   <chr> <chr>   
 1     1 2001-01-04 a        x       rp    rp      
 2     1 2007-09-23 b        x       p1    p1      
 3     1 2008-11-14 b        x       p2    p2      
 4     2 2009-11-13 a        x       nr    nr      
 5     2 2012-07-21 b        y       p0    p0      
 6     2 2014-09-15 b        y       p0    p0      
 7     3 2000-04-01 a        x       rp    rp      
 8     3 2008-07-14 b        x       p1    p1      
 9     3 2008-07-14 b        x       p1    p1      
10     4 2001-03-21 a        x       nr    nr      
11     4 2019-05-23 b        y       p0    p0      
12     4 2019-05-08 b        z       p0    p0      
13     5 2004-07-06 a        x       rp    rp      
14     5 2007-08-12 a        x       rp    rp      
15     5 2011-09-20 b        x       p1    p1      
16     5 2011-09-20 b        x       p1    p1      
17     5 2014-08-15 b        x       p2    p2      
18     5 2014-08-15 b        y       p0    p0 

The key is to write a function to correctly flag the categories based on your conditions. For each group of id and company , your conditions simplify to three mutually exclusive ones:

  • The company has both a and b ; code all a s "rp" and b s "p1-pn" in chronological order.
  • The company only has a ; code all a s "np".
  • The company only has b ; code all b s "p0".

Hence, consider the following function

flag_category <- function(x, date) {
  out <- character(length(x))
  a <- which(x == "a")
  b <- which(x == "b")
  if (length(a) > 0L && length(b) > 0L) {
    out[a] <- "rp"
    dateb <- date[b]    # get the date where category is "b"
    udateb <- unique(dateb)   # get the unique dates
    out[b] <- paste0("p", rank(udateb)[match(dateb, udateb)])    # `rank` finds the order for each unique date; use `match` to get the positions in `dateb` to which those ranks belong
    return(out)
  }
  if (length(a) > 0L) {
    out[] <- "nr"
    return(out)
  }
  out[] <- "p0"
  out
}

Then you can just apply it to each group of id and company .

dfx %>% group_by(id, company) %>% mutate(flag2 = flag_category(category, date)) 

Output

# A tibble: 18 x 6
# Groups:   id, company [9]
      id date       category company flag  flag2
   <dbl> <date>     <chr>    <chr>   <chr> <chr>
 1     1 2001-01-04 a        x       rp    rp   
 2     1 2007-09-23 b        x       p1    p1   
 3     1 2008-11-14 b        x       p2    p2   
 4     2 2009-11-13 a        x       nr    nr   
 5     2 2012-07-21 b        y       p0    p0   
 6     2 2014-09-15 b        y       p0    p0   
 7     3 2000-04-01 a        x       rp    rp   
 8     3 2008-07-14 b        x       p1    p1   
 9     3 2008-07-14 b        x       p1    p1   
10     4 2001-03-21 a        x       nr    nr   
11     4 2019-05-23 b        y       p0    p0   
12     4 2019-05-08 b        z       p0    p0   
13     5 2004-07-06 a        x       rp    rp   
14     5 2007-08-12 a        x       rp    rp   
15     5 2011-09-20 b        x       p1    p1   
16     5 2011-09-20 b        x       p1    p1   
17     5 2014-08-15 b        x       p2    p2   
18     5 2014-08-15 b        y       p0    p0 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM