![](/img/trans.png)
[英]R- Subtracting the mean of a group from each element of that group in a dataframe
[英]R- Iterating through each group and dynamically assigning values
我有以下數據集:
ID | 行名 | 開始日期 | 結束日期 | rows_overlap_period |
---|---|---|---|---|
人_1 | 1個 | 2010-04-23 | 2010-06-22 | 2,3,4,5,6 |
人_1 | 2個 | 2010-04-25 | 2010-06-24 | 3,4,5,6 |
人_1 | 3個 | 2010-04-27 | 2010-06-26 | 4,5,6,7 |
人_1 | 4個 | 2010-04-29 | 2010-06-28 | 5,6,7,8 |
人_1 | 5個 | 2010-04-30 | 2010-06-29 | 6,7,8 |
人_1 | 6個 | 2010-05-08 | 2010-07-07 | 7,8 |
人_1 | 7 | 2010-06-26 | 2010-08-25 | 8個 |
人_1 | 8個 | 2010-06-28 | 2010-08-27 | |
人_2 | 9 | 2010-07-30 | 2010-09-28 | 10 |
人_2 | 10 | 2010-08-02 | 2010-10-01 |
“rows_overlap_period”列指示哪些其他記錄在“start_date”和“end_date”期間開始。
但是,我會在每個組內進行迭代以得出以下結果:
ID | 行名 | 開始日期 | 結束日期 | rows_overlap_period |
---|---|---|---|---|
人_1 | 1個 | 2010-04-23 | 2010-06-22 | 2,3,4,5,6 |
人_1 | 2個 | 2010-04-25 | 2010-06-24 | |
人_1 | 3個 | 2010-04-27 | 2010-06-26 | |
人_1 | 4個 | 2010-04-29 | 2010-06-28 | |
人_1 | 5個 | 2010-04-30 | 2010-06-29 | |
人_1 | 6個 | 2010-05-08 | 2010-07-07 | |
人_1 | 7 | 2010-06-26 | 2010-08-25 | 8個 |
人_1 | 8個 | 2010-06-28 | 2010-08-27 | |
人_2 | 9 | 2010-07-30 | 2010-09-28 | 10 |
人_2 | 10 | 2010-08-02 | 2010-10-01 |
這個“輸出”將是“以下算法”的結果:
對於每個組:
可重現的例子(我到目前為止得到的):
# Input data
data.frame(id = c("person_1", "person_1", "person_1", "person_1", "person_1",
"person_1", "person_1", "person_1", "person_2",
"person_2"),
row_name = rep(1:10),
start_date = as.Date(c("2010-04-23", "2010-04-25", "2010-04-27",
"2010-04-29", "2010-04-30", "2010-05-08",
"2010-06-26", "2010-06-28", "2010-07-30",
"2010-08-02")),
end_date = as.Date(c("2010-06-22", "2010-06-24", "2010-06-26",
"2010-06-28", "2010-06-29", "2010-07-07",
"2010-08-25", "2010-08-27", "2010-09-28",
"2010-10-01"))) -> data
# Find overlaps (column rows_overlap_period)
sqldf::sqldf("select a.*,
coalesce(group_concat(b.row_name), ' ') as rows_overlap_period
from data a
left join data b on
a.id = b.id and
not a.row_name = b.row_name and
(b.start_date between
a.start_date and a.end_date)
group by a.rowid
order by a.rowid") -> data
我真的試圖直接使用 dplyr、data.table 或 sqldf 找到一些解決方案,但我找不到不實現“循環內循環”的方法——這會大大降低性能。
有人對我如何達到這個目標有任何建議嗎?
除了“id”列之外,我們還可以創建一個分組列來執行此操作
library(dplyr)
data %>%
group_by(id) %>%
mutate(grp = cumsum(lead(!nzchar(trimws(rows_overlap_period)),
default = FALSE))) %>%
group_by(grp, .add = TRUE) %>%
mutate(rows_overlap_period = case_when(row_number() ==1 ~
rows_overlap_period, TRUE ~ "")) %>%
ungroup %>%
select(-grp)
-輸出
# A tibble: 10 × 5
id row_name start_date end_date rows_overlap_period
<chr> <int> <date> <date> <chr>
1 person_1 1 2010-04-23 2010-06-22 "2,3,4,5,6"
2 person_1 2 2010-04-25 2010-06-24 ""
3 person_1 3 2010-04-27 2010-06-26 ""
4 person_1 4 2010-04-29 2010-06-28 ""
5 person_1 5 2010-04-30 2010-06-29 ""
6 person_1 6 2010-05-08 2010-07-07 ""
7 person_1 7 2010-06-26 2010-08-25 "8"
8 person_1 8 2010-06-28 2010-08-27 ""
9 person_2 9 2010-07-30 2010-09-28 "10"
10 person_2 10 2010-08-02 2010-10-01 ""
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.