简体   繁体   English

R-遍历每个组并动态分配值

[英]R- Iterating through each group and dynamically assigning values

I have the following dataset:我有以下数据集:

id ID row_name行名 start_date开始日期 end_date结束日期 rows_overlap_period rows_overlap_period
person_1人_1 1 1个 2010-04-23 2010-04-23 2010-06-22 2010-06-22 2,3,4,5,6 2,3,4,5,6
person_1人_1 2 2个 2010-04-25 2010-04-25 2010-06-24 2010-06-24 3,4,5,6 3,4,5,6
person_1人_1 3 3个 2010-04-27 2010-04-27 2010-06-26 2010-06-26 4,5,6,7 4,5,6,7
person_1人_1 4 4个 2010-04-29 2010-04-29 2010-06-28 2010-06-28 5,6,7,8 5,6,7,8
person_1人_1 5 5个 2010-04-30 2010-04-30 2010-06-29 2010-06-29 6,7,8 6,7,8
person_1人_1 6 6个 2010-05-08 2010-05-08 2010-07-07 2010-07-07 7,8 7,8
person_1人_1 7 7 2010-06-26 2010-06-26 2010-08-25 2010-08-25 8 8个
person_1人_1 8 8个 2010-06-28 2010-06-28 2010-08-27 2010-08-27
person_2人_2 9 9 2010-07-30 2010-07-30 2010-09-28 2010-09-28 10 10
person_2人_2 10 10 2010-08-02 2010-08-02 2010-10-01 2010-10-01

The "rows_overlap_period" column indicates which other records started between the 'start_date' and 'end_date' period. “rows_overlap_period”列指示哪些其他记录在“start_date”和“end_date”期间开始。

However, I would iterate within each group to arrive at the following result:但是,我会在每个组内进行迭代以得出以下结果:

id ID row_name行名 start_date开始日期 end_date结束日期 rows_overlap_period rows_overlap_period
person_1人_1 1 1个 2010-04-23 2010-04-23 2010-06-22 2010-06-22 2,3,4,5,6 2,3,4,5,6
person_1人_1 2 2个 2010-04-25 2010-04-25 2010-06-24 2010-06-24
person_1人_1 3 3个 2010-04-27 2010-04-27 2010-06-26 2010-06-26
person_1人_1 4 4个 2010-04-29 2010-04-29 2010-06-28 2010-06-28
person_1人_1 5 5个 2010-04-30 2010-04-30 2010-06-29 2010-06-29
person_1人_1 6 6个 2010-05-08 2010-05-08 2010-07-07 2010-07-07
person_1人_1 7 7 2010-06-26 2010-06-26 2010-08-25 2010-08-25 8 8个
person_1人_1 8 8个 2010-06-28 2010-06-28 2010-08-27 2010-08-27
person_2人_2 9 9 2010-07-30 2010-07-30 2010-09-28 2010-09-28 10 10
person_2人_2 10 10 2010-08-02 2010-08-02 2010-10-01 2010-10-01

This "output" would be the result of the 'following algorithm':这个“输出”将是“以下算法”的结果:

For each group:对于每个组:

  1. Get the first row for which the 'rows_overlap_period' column are not empty (eg: row_name = 1)获取“rows_overlap_period”列不为空的第一行(例如:row_name = 1)
  2. For selected row, I get list of overlap values (eg "2,3,4,5,6") and assign ' ' to all row_names in column 'rows_overlap_period' (in this case, replace the values "3,4,5,6", "4,5,6,7", "5,6,7,8", "6,7,8" and "7,8" with " ")对于选定的行,我得到重叠值的列表(例如“2,3,4,5,6”)并将''分配给'rows_overlap_period'列中的所有row_names(在这种情况下,替换值“3,4,5 ,6", "4,5,6,7", "5,6,7,8", "6,7,8" 和 "7,8" 带 " ")
  3. For the same group, I look for the next line that doesn't have null values and repeat steps 1 and 2. If it doesn't exist, I move on to the next group.对于同一组,我寻找下一行没有空值并重复步骤 1 和 2。如果不存在,我继续下一组。

Reproducible example (what I got so far):可重现的例子(我到目前为止得到的):

# Input data
data.frame(id = c("person_1", "person_1", "person_1", "person_1", "person_1",
                     "person_1", "person_1", "person_1", "person_2",
                     "person_2"),
           row_name = rep(1:10),
           start_date = as.Date(c("2010-04-23", "2010-04-25", "2010-04-27",
                                  "2010-04-29", "2010-04-30", "2010-05-08",
                                  "2010-06-26", "2010-06-28", "2010-07-30",
                                  "2010-08-02")),
           end_date = as.Date(c("2010-06-22", "2010-06-24", "2010-06-26",
                                "2010-06-28", "2010-06-29", "2010-07-07",
                                "2010-08-25", "2010-08-27", "2010-09-28",
                                "2010-10-01"))) -> data


# Find overlaps (column rows_overlap_period)
sqldf::sqldf("select a.*,
                     coalesce(group_concat(b.row_name), ' ') as rows_overlap_period
             from data a
             left join data b on
                       a.id = b.id and
                       not a.row_name = b.row_name and
                       (b.start_date between
                        a.start_date and a.end_date) 
                    group by a.rowid
                    order by a.rowid") -> data

I was really trying to find some solution using dplyr, data.table or sqldf directly, but I can't find ways not to implement 'loops within loops' - which would degrade performance a lot.我真的试图直接使用 dplyr、data.table 或 sqldf 找到一些解决方案,但我找不到不实现“循环内循环”的方法——这会大大降低性能。

Does anyone have any suggestions on how I can reach this?有人对我如何达到这个目标有任何建议吗?

We could create a grouping column to do this in addition to the 'id' column除了“id”列之外,我们还可以创建一个分组列来执行此操作

library(dplyr)
data %>% 
  group_by(id) %>% 
  mutate(grp = cumsum(lead(!nzchar(trimws(rows_overlap_period)),
     default = FALSE))) %>% 
  group_by(grp, .add = TRUE) %>% 
  mutate(rows_overlap_period = case_when(row_number() ==1 ~ 
      rows_overlap_period, TRUE ~ "")) %>%
  ungroup %>% 
  select(-grp)

-output -输出

# A tibble: 10 × 5
   id       row_name start_date end_date   rows_overlap_period
   <chr>       <int> <date>     <date>     <chr>              
 1 person_1        1 2010-04-23 2010-06-22 "2,3,4,5,6"        
 2 person_1        2 2010-04-25 2010-06-24 ""                 
 3 person_1        3 2010-04-27 2010-06-26 ""                 
 4 person_1        4 2010-04-29 2010-06-28 ""                 
 5 person_1        5 2010-04-30 2010-06-29 ""                 
 6 person_1        6 2010-05-08 2010-07-07 ""                 
 7 person_1        7 2010-06-26 2010-08-25 "8"                
 8 person_1        8 2010-06-28 2010-08-27 ""                 
 9 person_2        9 2010-07-30 2010-09-28 "10"               
10 person_2       10 2010-08-02 2010-10-01 ""      

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM