![](/img/trans.png)
[英]R- Subtracting the mean of a group from each element of that group in a dataframe
[英]R- Iterating through each group and dynamically assigning values
我有以下数据集:
ID | 行名 | 开始日期 | 结束日期 | rows_overlap_period |
---|---|---|---|---|
人_1 | 1个 | 2010-04-23 | 2010-06-22 | 2,3,4,5,6 |
人_1 | 2个 | 2010-04-25 | 2010-06-24 | 3,4,5,6 |
人_1 | 3个 | 2010-04-27 | 2010-06-26 | 4,5,6,7 |
人_1 | 4个 | 2010-04-29 | 2010-06-28 | 5,6,7,8 |
人_1 | 5个 | 2010-04-30 | 2010-06-29 | 6,7,8 |
人_1 | 6个 | 2010-05-08 | 2010-07-07 | 7,8 |
人_1 | 7 | 2010-06-26 | 2010-08-25 | 8个 |
人_1 | 8个 | 2010-06-28 | 2010-08-27 | |
人_2 | 9 | 2010-07-30 | 2010-09-28 | 10 |
人_2 | 10 | 2010-08-02 | 2010-10-01 |
“rows_overlap_period”列指示哪些其他记录在“start_date”和“end_date”期间开始。
但是,我会在每个组内进行迭代以得出以下结果:
ID | 行名 | 开始日期 | 结束日期 | rows_overlap_period |
---|---|---|---|---|
人_1 | 1个 | 2010-04-23 | 2010-06-22 | 2,3,4,5,6 |
人_1 | 2个 | 2010-04-25 | 2010-06-24 | |
人_1 | 3个 | 2010-04-27 | 2010-06-26 | |
人_1 | 4个 | 2010-04-29 | 2010-06-28 | |
人_1 | 5个 | 2010-04-30 | 2010-06-29 | |
人_1 | 6个 | 2010-05-08 | 2010-07-07 | |
人_1 | 7 | 2010-06-26 | 2010-08-25 | 8个 |
人_1 | 8个 | 2010-06-28 | 2010-08-27 | |
人_2 | 9 | 2010-07-30 | 2010-09-28 | 10 |
人_2 | 10 | 2010-08-02 | 2010-10-01 |
这个“输出”将是“以下算法”的结果:
对于每个组:
可重现的例子(我到目前为止得到的):
# Input data
data.frame(id = c("person_1", "person_1", "person_1", "person_1", "person_1",
"person_1", "person_1", "person_1", "person_2",
"person_2"),
row_name = rep(1:10),
start_date = as.Date(c("2010-04-23", "2010-04-25", "2010-04-27",
"2010-04-29", "2010-04-30", "2010-05-08",
"2010-06-26", "2010-06-28", "2010-07-30",
"2010-08-02")),
end_date = as.Date(c("2010-06-22", "2010-06-24", "2010-06-26",
"2010-06-28", "2010-06-29", "2010-07-07",
"2010-08-25", "2010-08-27", "2010-09-28",
"2010-10-01"))) -> data
# Find overlaps (column rows_overlap_period)
sqldf::sqldf("select a.*,
coalesce(group_concat(b.row_name), ' ') as rows_overlap_period
from data a
left join data b on
a.id = b.id and
not a.row_name = b.row_name and
(b.start_date between
a.start_date and a.end_date)
group by a.rowid
order by a.rowid") -> data
我真的试图直接使用 dplyr、data.table 或 sqldf 找到一些解决方案,但我找不到不实现“循环内循环”的方法——这会大大降低性能。
有人对我如何达到这个目标有任何建议吗?
除了“id”列之外,我们还可以创建一个分组列来执行此操作
library(dplyr)
data %>%
group_by(id) %>%
mutate(grp = cumsum(lead(!nzchar(trimws(rows_overlap_period)),
default = FALSE))) %>%
group_by(grp, .add = TRUE) %>%
mutate(rows_overlap_period = case_when(row_number() ==1 ~
rows_overlap_period, TRUE ~ "")) %>%
ungroup %>%
select(-grp)
-输出
# A tibble: 10 × 5
id row_name start_date end_date rows_overlap_period
<chr> <int> <date> <date> <chr>
1 person_1 1 2010-04-23 2010-06-22 "2,3,4,5,6"
2 person_1 2 2010-04-25 2010-06-24 ""
3 person_1 3 2010-04-27 2010-06-26 ""
4 person_1 4 2010-04-29 2010-06-28 ""
5 person_1 5 2010-04-30 2010-06-29 ""
6 person_1 6 2010-05-08 2010-07-07 ""
7 person_1 7 2010-06-26 2010-08-25 "8"
8 person_1 8 2010-06-28 2010-08-27 ""
9 person_2 9 2010-07-30 2010-09-28 "10"
10 person_2 10 2010-08-02 2010-10-01 ""
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.