简体   繁体   English

R中按类别重复变量

[英]Repeating variable in group by category in R

I have data for which I would like to create a new variable: flag我有要为其创建新变量的数据:标志
Data is set in a longitudinal format with repetition of id and have associated dates.数据以纵向格式设置,重复 id 并具有相关日期。

The other two important variables are category and company .另外两个重要的变量是category 和 company
Category: for each id there will be at least one category "a" and "b" , but most of the times there will be multiple "a" and "b".类别:对于每个 id,至少会有一个类别"a""b" ,但大多数时候会有多个 "a" 和 "b"。 Company: There could be multiple company for the same ids.公司:同一ID可能有多个公司。 Sometime category "b" would have the same company as category "a" for a particular id.有时,类别“b”与特定 ID 的类别“a”具有相同的公司。 Here for ease I have included only three company as x, y, z.为了方便起见,我只包括了三个公司,分别是 x、y、z。

I want to create a flag.我想创建一个标志。 So that when group_by id这样当 group_by id

  1. if there is at least one instance of same company launching product in category "b" and "a".如果至少有一个公司在“b”类和“a”类中推出产品。 Then flag the "a" with same product as "rp" (repeating product)*.然后将具有相同产品的“a”标记为“rp” (重复产品)*。
  2. If not, than flag all the corresponding "a" as "nr" (no repeating product in b).如果不是,则将所有相应的“a”标记为“nr” (b 中没有重复乘积)。
  3. For the "b" if there is a corresponding "rp".对于“b”,如果有对应的“rp”。 I want to sequence all the "b" with same company as "a" based on date such as p1, p2, p3,... (if the date of product for same then it could be p1, p1, p2,..), and for the remaining "b" with no same company as "p0"我想根据日期(如p1、p2、p3、...... ),对于其余的“b”,与“p0”没有同一家公司
  4. For the "b" with corresponding "a" as "nr" we can again call them as "p0"对于对应的“a”为“nr”的“b”,我们可以再次将它们称为“p0”

Below is the data frame with the flag variable(expected output)下面是带有标志变量的数据框(预期输出)

id<- c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,5,5,5)
date<- as.Date(c("2001-01-04", "2007-09-23", "2008-11-14",
                 "2009-11-13", "2012-07-21", "2014-09-15",
                 "2000-04-01", "2008-07-14", "2008-07-14", 
                 "2001-03-21", "2019-05-23", "2019-05-08", 
                 "2004-07-06", "2007-08-12", "2011-09-20", 
                 "2011-09-20", "2014-08-15", "2014-08-15"))
category<- c("a", "b", "b", "a", "b", "b", "a", "b", "b",
           "a", "b", "b", "a", "a", "b", "b", "b", "b")
company<-c("x", "x", "x", "x", "y", "y", "x", "x", "x",
           "x", "y", "z", "x", "x", "x", "x",  "x", "y")
flag<-c ("rp","p1", "p2", "nr", "p0", "p0", "rp", "p1",
         "p1", "nr", "p0", "p0", "rp", "rp", "p1", "p1", 
         "p2", "p0")
dfx <- data.frame(id, date, category, company, flag)

One possible approach with tidyverse , if I understand the logic correctly.如果我正确理解逻辑,一种可能的方法是tidyverse After grouping by both id and company , you can see if both categories "a" and "b" are present;同时按idcompany分组后,可以看到“a”和“b”这两个类别是否都存在; if so, mark those rows where category is "a" with "rp".如果是这样,用“rp”标记类别为“a”的那些行。

A more convoluted case_when can consider your different rules, but leave as missing NA situations where you need "p" with a sequence of numbers.一个更复杂的case_when可以考虑您的不同规则,但在您需要“p”和一系列数字的情况下留下缺失的NA情况。 A temporary column including a counter can be made based on these missing values to give you "p1", "p2", etc.可以根据这些缺失值制作一个包含计数器的临时列,为您提供“p1”、“p2”等。

library(tidyverse)

dfx %>%
  group_by(id, company) %>%
  mutate(new_flag = case_when(
    all(c("a", "b") %in% category) & category == "a" ~ "rp",
    category == "a" ~ "nr",
    TRUE ~ NA_character_)) %>%
  group_by(id) %>%
  mutate(new_flag = case_when(
    category == "b" & new_flag[category == "a"][1] == "nr" ~ "p0", 
    category == "b" & new_flag[category == "a"][1] == "rp" &
      company == company[category == "a"][1] ~ NA_character_,
    category == "b" & new_flag[category == "a"][1] == "rp" &
      company != company[category == "a"][1] ~ "p0",
    TRUE ~ new_flag)) %>%
  group_by(id, company) %>%
  mutate(ctr = cumsum(is.na(new_flag) & date != lag(date, default = first(date[is.na(new_flag)])))) %>%
  mutate(new_flag = ifelse(is.na(new_flag), paste0("p", ctr), new_flag)) %>%
  select(-ctr)

Output Output

      id date       category company flag  new_flag
   <dbl> <date>     <chr>    <chr>   <chr> <chr>   
 1     1 2001-01-04 a        x       rp    rp      
 2     1 2007-09-23 b        x       p1    p1      
 3     1 2008-11-14 b        x       p2    p2      
 4     2 2009-11-13 a        x       nr    nr      
 5     2 2012-07-21 b        y       p0    p0      
 6     2 2014-09-15 b        y       p0    p0      
 7     3 2000-04-01 a        x       rp    rp      
 8     3 2008-07-14 b        x       p1    p1      
 9     3 2008-07-14 b        x       p1    p1      
10     4 2001-03-21 a        x       nr    nr      
11     4 2019-05-23 b        y       p0    p0      
12     4 2019-05-08 b        z       p0    p0      
13     5 2004-07-06 a        x       rp    rp      
14     5 2007-08-12 a        x       rp    rp      
15     5 2011-09-20 b        x       p1    p1      
16     5 2011-09-20 b        x       p1    p1      
17     5 2014-08-15 b        x       p2    p2      
18     5 2014-08-15 b        y       p0    p0 

The key is to write a function to correctly flag the categories based on your conditions.关键是编写一个 function 以根据您的条件正确标记类别。 For each group of id and company , your conditions simplify to three mutually exclusive ones:对于每组idcompany ,您的条件简化为三个互斥的条件:

  • The company has both a and b ;公司既有a又有b code all a s "rp" and b s "p1-pn" in chronological order.按时间顺序编码所有a s "rp" 和b s "p1-pn"。
  • The company only has a ;公司只有一个 code all a s "np".全部编码“np”。
  • The company only has b ;公司只有b code all b s "p0".将所有b编码为“p0”。

Hence, consider the following function因此,考虑以下 function

flag_category <- function(x, date) {
  out <- character(length(x))
  a <- which(x == "a")
  b <- which(x == "b")
  if (length(a) > 0L && length(b) > 0L) {
    out[a] <- "rp"
    dateb <- date[b]    # get the date where category is "b"
    udateb <- unique(dateb)   # get the unique dates
    out[b] <- paste0("p", rank(udateb)[match(dateb, udateb)])    # `rank` finds the order for each unique date; use `match` to get the positions in `dateb` to which those ranks belong
    return(out)
  }
  if (length(a) > 0L) {
    out[] <- "nr"
    return(out)
  }
  out[] <- "p0"
  out
}

Then you can just apply it to each group of id and company .然后你可以将它应用到每组idcompany

dfx %>% group_by(id, company) %>% mutate(flag2 = flag_category(category, date)) 

Output Output

# A tibble: 18 x 6
# Groups:   id, company [9]
      id date       category company flag  flag2
   <dbl> <date>     <chr>    <chr>   <chr> <chr>
 1     1 2001-01-04 a        x       rp    rp   
 2     1 2007-09-23 b        x       p1    p1   
 3     1 2008-11-14 b        x       p2    p2   
 4     2 2009-11-13 a        x       nr    nr   
 5     2 2012-07-21 b        y       p0    p0   
 6     2 2014-09-15 b        y       p0    p0   
 7     3 2000-04-01 a        x       rp    rp   
 8     3 2008-07-14 b        x       p1    p1   
 9     3 2008-07-14 b        x       p1    p1   
10     4 2001-03-21 a        x       nr    nr   
11     4 2019-05-23 b        y       p0    p0   
12     4 2019-05-08 b        z       p0    p0   
13     5 2004-07-06 a        x       rp    rp   
14     5 2007-08-12 a        x       rp    rp   
15     5 2011-09-20 b        x       p1    p1   
16     5 2011-09-20 b        x       p1    p1   
17     5 2014-08-15 b        x       p2    p2   
18     5 2014-08-15 b        y       p0    p0 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM