簡體   English   中英

R For 循環和 If-else data.table

[英]R For Loop and If-else data.table

我被困在我試圖創建的 for 循環中。 示例數據集如下:

ex <- structure(list(person_id = c("79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", 
"8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", 
"8b6ea77b-e694-48fb-a9e9-ca8bf1accc65"), prs_nat_key = c("8240588160001", 
"8240588160001", "8240588160001", "8240588160001", "106705689", 
"106705689", "106705689", "106705689"), serv_from_dt = structure(c(18262, 
18262, 18262, 18262, 18278, 18278, 18278, 18278), class = "Date"), 
    serv_to_dt = structure(c(18262, 18262, 18262, 18265, 18282, 
    18282, 18299, 18299), class = "Date"), new_pos = c("IP", 
    "IP", "IP", "IP", "IP", "IP", "IP", "IP"), days_diff = c(0, 
    0, 0, 3, 4, 4, 21, 21)), row.names = c(NA, -8L), class = c("data.table", 
"data.frame"))

我正在嘗試創建一個名為 start_date 的新列。 此列將根據每個 person_id 的 serv_from_dt 和 serv_to_dt 日期創建。 到目前為止,我這樣做的方式如下:

通過每個 person_id 找到唯一的 serv_from_dt,其中 serv_from_dt 和 serv_to_dt 之間的日期差異大於 0(我們稱之為 diff_date); 如果按行,serv_frm_dt >= person_id 的 MAX 唯一 diff_date,並且 serv_to_dt <= person_id 的 MAX 唯一 diff_date,則標記為該唯一 diff_date。 到目前為止我有這個:

 values=ex[,.(uniqueN(sort(unique(serv_to_dt[ex$days_diff>0]), TRUE))), person_id]
    n = as.numeric(values[,1])
    m = as.numeric(values[,2])

for (i in m){
  ex[,`:=`(min_start = fifelse((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[1] & 
                             serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[1]), 
                           sort(unique(serv_from_dt[ex$days_diff>0]))[1], fifelse((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[i] & 
                                                                                     serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[i]), 
                                                                                  sort(unique(serv_from_dt[ex$days_diff>0]))[i], serv_from_dt)),
           max_end = fifelse((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[1] & 
                                  serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[1]), 
                               sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[1], fifelse((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[i] & 
                                                                                         serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[i]), 
                                                                                      sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[i], serv_from_dt))), prs_nat_key]
}

上面的代碼正是我想要的,但我不知道如何為具有多個 person_ids 和多個 day_diffs 的更大數據集擴展它。 我希望代碼是這樣的,如果 serv_frm/serv_to_dts 在最大唯一 diff_date 之間不成立,則循環到下一個唯一 diff_date。 在這種情況下,兩個 person_id 都只有 1 個唯一的 diff_date(所以 m = 1),但我想更新代碼以在 m > 1 的情況下保持正確。我也嘗試使用 base R 來做,但不斷收到錯誤:

for(j in 1:m){

    
    ex[, min_start := if((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[j] & 
                          serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[j])) sort(unique(serv_from_dt[ex$days_diff>0]))[j]]
  j = j+ 1
  
}

任何幫助將不勝感激。

我的最終目標是創建兩個名為 min_start 和 max_end 的新列。 我意識到我可以做一個連接而不是做 ifelse 語句。 以下是我使用稍大的示例數據集的步驟:

ex <- structure(list(person_id = c("79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", 
"8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", 
"8b6ea77b-e694-48fb-a9e9-ca8bf1accc65"), prs_nat_key = c("8240588160001", 
"8240588160001", "8240588160001", "8240588160001", "8240588160001", 
"8240588160001", "8240588160001", "8240588160001", "106705689", 
"106705689", "106705689", "106705689"), serv_from_dt = structure(c(18262, 
18262, 18262, 18262, 18275, 18275, 18275, 18275, 18278, 18278, 
18278, 18278), class = "Date"), serv_to_dt = structure(c(18262, 
18262, 18262, 18265, 18275, 18278, 18278, 18278, 18282, 18282, 
18299, 18299), class = "Date"), new_pos = c("IP", "IP", "IP", 
"IP", "IP", "IP", "IP", "IP", "IP", "IP", "IP", "IP"), days_diff = c(0, 
0, 0, 3, 0, 3, 3, 3, 4, 4, 21, 21)), row.names = c(NA, -12L), class = c("data.table", 
"data.frame"))

創建一個新的數據框,其中每個人只有唯一的開始/結束日期:

date_period <- ex[, .(unique_start = unique(serv_from_dt[days_diff>0]),
                      unique_end = unique(serv_to_dt[days_diff>0])), prs_nat_key][order(prs_nat_key,unique_start,-unique_end),]

date_period %<>% distinct(prs_nat_key, unique_start, .keep_all = TRUE) %>% setDT()

在此條件下進行左連接:如果 date_period$prs_nat_key = ex$prs_nat_key & ex$serv_from_dt >= date_period$unique_start & ex$serv_from_dt <= date_period$unique_end & ex$serv_to_dt >= date_period$unique_start & ex$serv_to_dt < $unique_end

ex[, c("start_date", "end_date") := 
             date_period[ex, # join
                 .(unique_start, unique_end),
                 on = .(unique_start < serv_from_dt,
                        unique_start < serv_to_dt,
                        unique_end > serv_to_dt,
                        unique_end > serv_from_dt,
                        prs_nat_key = prs_nat_key)]]

我從這個問題中發現的 --> 數據表中的條件連接?

不確定你的最終結果應該是什么,但它看起來過於復雜。 例如,您創建的 date_period 表可以這樣完成:

ex[, .(unique_start = first(serv_from_dt), unique_end = last(serv_to_dt)), by = c("prs_nat_key", "serv_from_dt")]

#      prs_nat_key serv_from_dt unique_start unique_end
# 1: 8240588160001   2020-01-01   2020-01-01 2020-01-04
# 2: 8240588160001   2020-01-14   2020-01-14 2020-01-17
# 3:     106705689   2020-01-17   2020-01-17 2020-02-07

由於您似乎試圖將其重新加入到原始表中,這也許正是您想要的。 是的,這就是您發布的原始表格所需的全部內容。

ex[, `:=` (start_date = first(serv_from_dt), end_date = last(serv_to_dt)), by = c("prs_nat_key", "serv_from_dt")]

#                                person_id   prs_nat_key serv_from_dt serv_to_dt new_pos days_diff start_date   end_date
#  1: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-01 2020-01-01      IP         0 2020-01-01 2020-01-04
#  2: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-01 2020-01-01      IP         0 2020-01-01 2020-01-04
#  3: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-01 2020-01-01      IP         0 2020-01-01 2020-01-04
#  4: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-01 2020-01-04      IP         3 2020-01-01 2020-01-04
#  5: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-14 2020-01-14      IP         0 2020-01-14 2020-01-17
#  6: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-14 2020-01-17      IP         3 2020-01-14 2020-01-17
#  7: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-14 2020-01-17      IP         3 2020-01-14 2020-01-17
#  8: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-14 2020-01-17      IP         3 2020-01-14 2020-01-17
#  9: 8b6ea77b-e694-48fb-a9e9-ca8bf1accc65     106705689   2020-01-17 2020-01-21      IP         4 2020-01-17 2020-02-07
# 10: 8b6ea77b-e694-48fb-a9e9-ca8bf1accc65     106705689   2020-01-17 2020-01-21      IP         4 2020-01-17 2020-02-07
# 11: 8b6ea77b-e694-48fb-a9e9-ca8bf1accc65     106705689   2020-01-17 2020-02-07      IP        21 2020-01-17 2020-02-07
# 12: 8b6ea77b-e694-48fb-a9e9-ca8bf1accc65     106705689   2020-01-17 2020-02-07      IP        21 2020-01-17 2020-02-07

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM