[英]Repeating variable in group by category in R
我有要为其创建新变量的数据:标志
数据以纵向格式设置,重复 id 并具有相关日期。
另外两个重要的变量是category 和 company 。
类别:对于每个 id,至少会有一个类别"a"和"b" ,但大多数时候会有多个 "a" 和 "b"。 公司:同一ID可能有多个公司。 有时,类别“b”与特定 ID 的类别“a”具有相同的公司。 为了方便起见,我只包括了三个公司,分别是 x、y、z。
我想创建一个标志。 这样当 group_by id
下面是带有标志变量的数据框(预期输出)
id<- c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,5,5,5)
date<- as.Date(c("2001-01-04", "2007-09-23", "2008-11-14",
"2009-11-13", "2012-07-21", "2014-09-15",
"2000-04-01", "2008-07-14", "2008-07-14",
"2001-03-21", "2019-05-23", "2019-05-08",
"2004-07-06", "2007-08-12", "2011-09-20",
"2011-09-20", "2014-08-15", "2014-08-15"))
category<- c("a", "b", "b", "a", "b", "b", "a", "b", "b",
"a", "b", "b", "a", "a", "b", "b", "b", "b")
company<-c("x", "x", "x", "x", "y", "y", "x", "x", "x",
"x", "y", "z", "x", "x", "x", "x", "x", "y")
flag<-c ("rp","p1", "p2", "nr", "p0", "p0", "rp", "p1",
"p1", "nr", "p0", "p0", "rp", "rp", "p1", "p1",
"p2", "p0")
dfx <- data.frame(id, date, category, company, flag)
如果我正确理解逻辑,一种可能的方法是tidyverse
。 同时按id
和company
分组后,可以看到“a”和“b”这两个类别是否都存在; 如果是这样,用“rp”标记类别为“a”的那些行。
一个更复杂的case_when
可以考虑您的不同规则,但在您需要“p”和一系列数字的情况下留下缺失的NA
情况。 可以根据这些缺失值制作一个包含计数器的临时列,为您提供“p1”、“p2”等。
library(tidyverse)
dfx %>%
group_by(id, company) %>%
mutate(new_flag = case_when(
all(c("a", "b") %in% category) & category == "a" ~ "rp",
category == "a" ~ "nr",
TRUE ~ NA_character_)) %>%
group_by(id) %>%
mutate(new_flag = case_when(
category == "b" & new_flag[category == "a"][1] == "nr" ~ "p0",
category == "b" & new_flag[category == "a"][1] == "rp" &
company == company[category == "a"][1] ~ NA_character_,
category == "b" & new_flag[category == "a"][1] == "rp" &
company != company[category == "a"][1] ~ "p0",
TRUE ~ new_flag)) %>%
group_by(id, company) %>%
mutate(ctr = cumsum(is.na(new_flag) & date != lag(date, default = first(date[is.na(new_flag)])))) %>%
mutate(new_flag = ifelse(is.na(new_flag), paste0("p", ctr), new_flag)) %>%
select(-ctr)
Output
id date category company flag new_flag
<dbl> <date> <chr> <chr> <chr> <chr>
1 1 2001-01-04 a x rp rp
2 1 2007-09-23 b x p1 p1
3 1 2008-11-14 b x p2 p2
4 2 2009-11-13 a x nr nr
5 2 2012-07-21 b y p0 p0
6 2 2014-09-15 b y p0 p0
7 3 2000-04-01 a x rp rp
8 3 2008-07-14 b x p1 p1
9 3 2008-07-14 b x p1 p1
10 4 2001-03-21 a x nr nr
11 4 2019-05-23 b y p0 p0
12 4 2019-05-08 b z p0 p0
13 5 2004-07-06 a x rp rp
14 5 2007-08-12 a x rp rp
15 5 2011-09-20 b x p1 p1
16 5 2011-09-20 b x p1 p1
17 5 2014-08-15 b x p2 p2
18 5 2014-08-15 b y p0 p0
关键是编写一个 function 以根据您的条件正确标记类别。 对于每组id
和company
,您的条件简化为三个互斥的条件:
因此,考虑以下 function
flag_category <- function(x, date) {
out <- character(length(x))
a <- which(x == "a")
b <- which(x == "b")
if (length(a) > 0L && length(b) > 0L) {
out[a] <- "rp"
dateb <- date[b] # get the date where category is "b"
udateb <- unique(dateb) # get the unique dates
out[b] <- paste0("p", rank(udateb)[match(dateb, udateb)]) # `rank` finds the order for each unique date; use `match` to get the positions in `dateb` to which those ranks belong
return(out)
}
if (length(a) > 0L) {
out[] <- "nr"
return(out)
}
out[] <- "p0"
out
}
然后你可以将它应用到每组id
和company
。
dfx %>% group_by(id, company) %>% mutate(flag2 = flag_category(category, date))
Output
# A tibble: 18 x 6
# Groups: id, company [9]
id date category company flag flag2
<dbl> <date> <chr> <chr> <chr> <chr>
1 1 2001-01-04 a x rp rp
2 1 2007-09-23 b x p1 p1
3 1 2008-11-14 b x p2 p2
4 2 2009-11-13 a x nr nr
5 2 2012-07-21 b y p0 p0
6 2 2014-09-15 b y p0 p0
7 3 2000-04-01 a x rp rp
8 3 2008-07-14 b x p1 p1
9 3 2008-07-14 b x p1 p1
10 4 2001-03-21 a x nr nr
11 4 2019-05-23 b y p0 p0
12 4 2019-05-08 b z p0 p0
13 5 2004-07-06 a x rp rp
14 5 2007-08-12 a x rp rp
15 5 2011-09-20 b x p1 p1
16 5 2011-09-20 b x p1 p1
17 5 2014-08-15 b x p2 p2
18 5 2014-08-15 b y p0 p0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.