R中按類別重復變量

Question

我有要為其創建新變量的數據：標志
數據以縱向格式設置，重復 id 並具有相關日期。

另外兩個重要的變量是category 和 company 。
類別：對於每個 id，至少會有一個類別"a"和"b" ，但大多數時候會有多個 "a" 和 "b"。 公司：同一ID可能有多個公司。 有時，類別“b”與特定 ID 的類別“a”具有相同的公司。 為了方便起見，我只包括了三個公司，分別是 x、y、z。

我想創建一個標志。 這樣當 group_by id

如果至少有一個公司在“b”類和“a”類中推出產品。 然后將具有相同產品的“a”標記為“rp” （重復產品）*。
如果不是，則將所有相應的“a”標記為“nr” （b 中沒有重復乘積）。
對於“b”，如果有對應的“rp”。 我想根據日期（如p1、p2、p3、...... )，對於其余的“b”，與“p0”沒有同一家公司
對於對應的“a”為“nr”的“b”，我們可以再次將它們稱為“p0”

下面是帶有標志變量的數據框（預期輸出）

id<- c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,5,5,5)
date<- as.Date(c("2001-01-04", "2007-09-23", "2008-11-14",
                 "2009-11-13", "2012-07-21", "2014-09-15",
                 "2000-04-01", "2008-07-14", "2008-07-14", 
                 "2001-03-21", "2019-05-23", "2019-05-08", 
                 "2004-07-06", "2007-08-12", "2011-09-20", 
                 "2011-09-20", "2014-08-15", "2014-08-15"))
category<- c("a", "b", "b", "a", "b", "b", "a", "b", "b",
           "a", "b", "b", "a", "a", "b", "b", "b", "b")
company<-c("x", "x", "x", "x", "y", "y", "x", "x", "x",
           "x", "y", "z", "x", "x", "x", "x",  "x", "y")
flag<-c ("rp","p1", "p2", "nr", "p0", "p0", "rp", "p1",
         "p1", "nr", "p0", "p0", "rp", "rp", "p1", "p1", 
         "p2", "p0")
dfx <- data.frame(id, date, category, company, flag)

Answer 1

如果我正確理解邏輯，一種可能的方法是tidyverse 。 同時按id和company分組后，可以看到“a”和“b”這兩個類別是否都存在； 如果是這樣，用“rp”標記類別為“a”的那些行。

一個更復雜的case_when可以考慮您的不同規則，但在您需要“p”和一系列數字的情況下留下缺失的NA情況。 可以根據這些缺失值制作一個包含計數器的臨時列，為您提供“p1”、“p2”等。

library(tidyverse)

dfx %>%
  group_by(id, company) %>%
  mutate(new_flag = case_when(
    all(c("a", "b") %in% category) & category == "a" ~ "rp",
    category == "a" ~ "nr",
    TRUE ~ NA_character_)) %>%
  group_by(id) %>%
  mutate(new_flag = case_when(
    category == "b" & new_flag[category == "a"][1] == "nr" ~ "p0", 
    category == "b" & new_flag[category == "a"][1] == "rp" &
      company == company[category == "a"][1] ~ NA_character_,
    category == "b" & new_flag[category == "a"][1] == "rp" &
      company != company[category == "a"][1] ~ "p0",
    TRUE ~ new_flag)) %>%
  group_by(id, company) %>%
  mutate(ctr = cumsum(is.na(new_flag) & date != lag(date, default = first(date[is.na(new_flag)])))) %>%
  mutate(new_flag = ifelse(is.na(new_flag), paste0("p", ctr), new_flag)) %>%
  select(-ctr)

Output

      id date       category company flag  new_flag
   <dbl> <date>     <chr>    <chr>   <chr> <chr>   
 1     1 2001-01-04 a        x       rp    rp      
 2     1 2007-09-23 b        x       p1    p1      
 3     1 2008-11-14 b        x       p2    p2      
 4     2 2009-11-13 a        x       nr    nr      
 5     2 2012-07-21 b        y       p0    p0      
 6     2 2014-09-15 b        y       p0    p0      
 7     3 2000-04-01 a        x       rp    rp      
 8     3 2008-07-14 b        x       p1    p1      
 9     3 2008-07-14 b        x       p1    p1      
10     4 2001-03-21 a        x       nr    nr      
11     4 2019-05-23 b        y       p0    p0      
12     4 2019-05-08 b        z       p0    p0      
13     5 2004-07-06 a        x       rp    rp      
14     5 2007-08-12 a        x       rp    rp      
15     5 2011-09-20 b        x       p1    p1      
16     5 2011-09-20 b        x       p1    p1      
17     5 2014-08-15 b        x       p2    p2      
18     5 2014-08-15 b        y       p0    p0

Answer 2

關鍵是編寫一個 function 以根據您的條件正確標記類別。 對於每組id和company ，您的條件簡化為三個互斥的條件：

公司既有a又有b ； 按時間順序編碼所有a s "rp" 和b s "p1-pn"。
公司只有一個； 全部編碼為“np”。
公司只有b ； 將所有b編碼為“p0”。

因此，考慮以下 function

flag_category <- function(x, date) {
  out <- character(length(x))
  a <- which(x == "a")
  b <- which(x == "b")
  if (length(a) > 0L && length(b) > 0L) {
    out[a] <- "rp"
    dateb <- date[b]    # get the date where category is "b"
    udateb <- unique(dateb)   # get the unique dates
    out[b] <- paste0("p", rank(udateb)[match(dateb, udateb)])    # `rank` finds the order for each unique date; use `match` to get the positions in `dateb` to which those ranks belong
    return(out)
  }
  if (length(a) > 0L) {
    out[] <- "nr"
    return(out)
  }
  out[] <- "p0"
  out
}

然后你可以將它應用到每組id和company 。

dfx %>% group_by(id, company) %>% mutate(flag2 = flag_category(category, date))

Output

# A tibble: 18 x 6
# Groups:   id, company [9]
      id date       category company flag  flag2
   <dbl> <date>     <chr>    <chr>   <chr> <chr>
 1     1 2001-01-04 a        x       rp    rp   
 2     1 2007-09-23 b        x       p1    p1   
 3     1 2008-11-14 b        x       p2    p2   
 4     2 2009-11-13 a        x       nr    nr   
 5     2 2012-07-21 b        y       p0    p0   
 6     2 2014-09-15 b        y       p0    p0   
 7     3 2000-04-01 a        x       rp    rp   
 8     3 2008-07-14 b        x       p1    p1   
 9     3 2008-07-14 b        x       p1    p1   
10     4 2001-03-21 a        x       nr    nr   
11     4 2019-05-23 b        y       p0    p0   
12     4 2019-05-08 b        z       p0    p0   
13     5 2004-07-06 a        x       rp    rp   
14     5 2007-08-12 a        x       rp    rp   
15     5 2011-09-20 b        x       p1    p1   
16     5 2011-09-20 b        x       p1    p1   
17     5 2014-08-15 b        x       p2    p2   
18     5 2014-08-15 b        y       p0    p0

R中按類別重復變量

問題描述

2 個解決方案

解決方案1
1 2022-01-11 03:09:00

解決方案2
0 2022-01-11 03:04:57

R中按類別重復變量

問題描述

2 個解決方案

解決方案1 1 2022-01-11 03:09:00

解決方案2 0 2022-01-11 03:04:57

解決方案1
1 2022-01-11 03:09:00

解決方案2
0 2022-01-11 03:04:57