R For 循环和 If-else data.table

Question

I'm stuck on a for-loop that I'm trying to create.我被困在我试图创建的 for 循环中。 The example dataset is below:示例数据集如下：

ex <- structure(list(person_id = c("79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", 
"8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", 
"8b6ea77b-e694-48fb-a9e9-ca8bf1accc65"), prs_nat_key = c("8240588160001", 
"8240588160001", "8240588160001", "8240588160001", "106705689", 
"106705689", "106705689", "106705689"), serv_from_dt = structure(c(18262, 
18262, 18262, 18262, 18278, 18278, 18278, 18278), class = "Date"), 
    serv_to_dt = structure(c(18262, 18262, 18262, 18265, 18282, 
    18282, 18299, 18299), class = "Date"), new_pos = c("IP", 
    "IP", "IP", "IP", "IP", "IP", "IP", "IP"), days_diff = c(0, 
    0, 0, 3, 4, 4, 21, 21)), row.names = c(NA, -8L), class = c("data.table", 
"data.frame"))

I'm trying to create a new column, called start_date.我正在尝试创建一个名为 start_date 的新列。 This column would be created based off of the serv_from_dt and serv_to_dt dates of each person_id.此列将根据每个 person_id 的 serv_from_dt 和 serv_to_dt 日期创建。 The way I'm doing this so far is as follows:到目前为止，我这样做的方式如下：

find the unique serv_from_dt's by each person_id where the date differences between serv_from_dt and serv_to_dt are greater than 0 (let's just call this diff_date);通过每个 person_id 找到唯一的 serv_from_dt，其中 serv_from_dt 和 serv_to_dt 之间的日期差异大于 0（我们称之为 diff_date）； if, by row, the serv_frm_dt is >= the person_id's MAX unique diff_date, and the serv_to_dt <= person_id's MAX unique diff_date, then label as that unique diff_date.如果按行，serv_frm_dt >= person_id 的 MAX 唯一 diff_date，并且 serv_to_dt <= person_id 的 MAX 唯一 diff_date，则标记为该唯一 diff_date。 I have this so far:到目前为止我有这个：

 values=ex[,.(uniqueN(sort(unique(serv_to_dt[ex$days_diff>0]), TRUE))), person_id]
    n = as.numeric(values[,1])
    m = as.numeric(values[,2])

for (i in m){
  ex[,`:=`(min_start = fifelse((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[1] & 
                             serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[1]), 
                           sort(unique(serv_from_dt[ex$days_diff>0]))[1], fifelse((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[i] & 
                                                                                     serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[i]), 
                                                                                  sort(unique(serv_from_dt[ex$days_diff>0]))[i], serv_from_dt)),
           max_end = fifelse((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[1] & 
                                  serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[1]), 
                               sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[1], fifelse((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[i] & 
                                                                                         serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[i]), 
                                                                                      sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[i], serv_from_dt))), prs_nat_key]
}

the above code gives me exactly what I want, but I don't know how to scale this for a larger dataset with multiple person_ids and multiple day_diffs.上面的代码正是我想要的，但我不知道如何为具有多个 person_ids 和多个 day_diffs 的更大数据集扩展它。 I would like the code to be such that if the serv_frm/serv_to_dts do not hold true of being between the max unique diff_date, to loop to the next unique diff_date.我希望代码是这样的，如果 serv_frm/serv_to_dts 在最大唯一 diff_date 之间不成立，则循环到下一个唯一 diff_date。 In this case, both person_id's only have 1 unique diff_date (so m = 1), but I would like to update the code to hold true in instances where m > 1. I've also tried doing it using base R but keep getting errors:在这种情况下，两个 person_id 都只有 1 个唯一的 diff_date（所以 m = 1），但我想更新代码以在 m > 1 的情况下保持正确。我也尝试使用 base R 来做，但不断收到错误：

for(j in 1:m){

    
    ex[, min_start := if((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[j] & 
                          serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[j])) sort(unique(serv_from_dt[ex$days_diff>0]))[j]]
  j = j+ 1
  
}

any help would be greatly appreciated.任何帮助将不胜感激。

Answer 1

My end goal was to create two new columns called min_start and max_end.我的最终目标是创建两个名为 min_start 和 max_end 的新列。 I realized instead of doing ifelse statements, I could do a join.我意识到我可以做一个连接而不是做 ifelse 语句。 Here are my steps using a slightly larger example dataset:以下是我使用稍大的示例数据集的步骤：

ex <- structure(list(person_id = c("79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", 
"8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", 
"8b6ea77b-e694-48fb-a9e9-ca8bf1accc65"), prs_nat_key = c("8240588160001", 
"8240588160001", "8240588160001", "8240588160001", "8240588160001", 
"8240588160001", "8240588160001", "8240588160001", "106705689", 
"106705689", "106705689", "106705689"), serv_from_dt = structure(c(18262, 
18262, 18262, 18262, 18275, 18275, 18275, 18275, 18278, 18278, 
18278, 18278), class = "Date"), serv_to_dt = structure(c(18262, 
18262, 18262, 18265, 18275, 18278, 18278, 18278, 18282, 18282, 
18299, 18299), class = "Date"), new_pos = c("IP", "IP", "IP", 
"IP", "IP", "IP", "IP", "IP", "IP", "IP", "IP", "IP"), days_diff = c(0, 
0, 0, 3, 0, 3, 3, 3, 4, 4, 21, 21)), row.names = c(NA, -12L), class = c("data.table", 
"data.frame"))

create a new dataframe that has only the unique start/end dates for each person:创建一个新的数据框，其中每个人只有唯一的开始/结束日期：

date_period <- ex[, .(unique_start = unique(serv_from_dt[days_diff>0]),
                      unique_end = unique(serv_to_dt[days_diff>0])), prs_nat_key][order(prs_nat_key,unique_start,-unique_end),]

date_period %<>% distinct(prs_nat_key, unique_start, .keep_all = TRUE) %>% setDT()

do a left join on this condition: if date_period$prs_nat_key = ex$prs_nat_key & ex$serv_from_dt >= date_period$unique_start & ex$serv_from_dt <= date_period$unique_end & ex$serv_to_dt >= date_period$unique_start & ex$serv_to_dt <= date_period$unique_end在此条件下进行左连接：如果 date_period$prs_nat_key = ex$prs_nat_key & ex$serv_from_dt >= date_period$unique_start & ex$serv_from_dt <= date_period$unique_end & ex$serv_to_dt >= date_period$unique_start & ex$serv_to_dt < $unique_end

ex[, c("start_date", "end_date") := 
             date_period[ex, # join
                 .(unique_start, unique_end),
                 on = .(unique_start < serv_from_dt,
                        unique_start < serv_to_dt,
                        unique_end > serv_to_dt,
                        unique_end > serv_from_dt,
                        prs_nat_key = prs_nat_key)]]

which I found from this question --> Conditional join in data.table?我从这个问题中发现的 --> 数据表中的条件连接？

Answer 2

Not sure what your final outcome should be, but it looks overcomplicated.不确定你的最终结果应该是什么，但它看起来过于复杂。 For example the date_period table you created could be done like this:例如，您创建的 date_period 表可以这样完成：

ex[, .(unique_start = first(serv_from_dt), unique_end = last(serv_to_dt)), by = c("prs_nat_key", "serv_from_dt")]

#      prs_nat_key serv_from_dt unique_start unique_end
# 1: 8240588160001   2020-01-01   2020-01-01 2020-01-04
# 2: 8240588160001   2020-01-14   2020-01-14 2020-01-17
# 3:     106705689   2020-01-17   2020-01-17 2020-02-07

As it seems that you try to rejoin it back to the original table than perhaps this is what you want.由于您似乎试图将其重新加入到原始表中，这也许正是您想要的。 Yes this is all that is needed from the original table you posted.是的，这就是您发布的原始表格所需的全部内容。

ex[, `:=` (start_date = first(serv_from_dt), end_date = last(serv_to_dt)), by = c("prs_nat_key", "serv_from_dt")]

#                                person_id   prs_nat_key serv_from_dt serv_to_dt new_pos days_diff start_date   end_date
#  1: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-01 2020-01-01      IP         0 2020-01-01 2020-01-04
#  2: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-01 2020-01-01      IP         0 2020-01-01 2020-01-04
#  3: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-01 2020-01-01      IP         0 2020-01-01 2020-01-04
#  4: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-01 2020-01-04      IP         3 2020-01-01 2020-01-04
#  5: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-14 2020-01-14      IP         0 2020-01-14 2020-01-17
#  6: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-14 2020-01-17      IP         3 2020-01-14 2020-01-17
#  7: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-14 2020-01-17      IP         3 2020-01-14 2020-01-17
#  8: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-14 2020-01-17      IP         3 2020-01-14 2020-01-17
#  9: 8b6ea77b-e694-48fb-a9e9-ca8bf1accc65     106705689   2020-01-17 2020-01-21      IP         4 2020-01-17 2020-02-07
# 10: 8b6ea77b-e694-48fb-a9e9-ca8bf1accc65     106705689   2020-01-17 2020-01-21      IP         4 2020-01-17 2020-02-07
# 11: 8b6ea77b-e694-48fb-a9e9-ca8bf1accc65     106705689   2020-01-17 2020-02-07      IP        21 2020-01-17 2020-02-07
# 12: 8b6ea77b-e694-48fb-a9e9-ca8bf1accc65     106705689   2020-01-17 2020-02-07      IP        21 2020-01-17 2020-02-07

R For 循环和 If-else data.table

问题描述

2 个解决方案

解决方案1
0 2021-10-26 01:19:48

解决方案2
0 2021-10-27 12:54:29

R For 循环和 If-else data.table

问题描述

2 个解决方案

解决方案1 0 2021-10-26 01:19:48

解决方案2 0 2021-10-27 12:54:29

解决方案1
0 2021-10-26 01:19:48

解决方案2
0 2021-10-27 12:54:29