[英]R For Loop and If-else data.table
我被困在我試圖創建的 for 循環中。 示例數據集如下:
ex <- structure(list(person_id = c("79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8",
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8",
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65",
"8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65",
"8b6ea77b-e694-48fb-a9e9-ca8bf1accc65"), prs_nat_key = c("8240588160001",
"8240588160001", "8240588160001", "8240588160001", "106705689",
"106705689", "106705689", "106705689"), serv_from_dt = structure(c(18262,
18262, 18262, 18262, 18278, 18278, 18278, 18278), class = "Date"),
serv_to_dt = structure(c(18262, 18262, 18262, 18265, 18282,
18282, 18299, 18299), class = "Date"), new_pos = c("IP",
"IP", "IP", "IP", "IP", "IP", "IP", "IP"), days_diff = c(0,
0, 0, 3, 4, 4, 21, 21)), row.names = c(NA, -8L), class = c("data.table",
"data.frame"))
我正在嘗試創建一個名為 start_date 的新列。 此列將根據每個 person_id 的 serv_from_dt 和 serv_to_dt 日期創建。 到目前為止,我這樣做的方式如下:
通過每個 person_id 找到唯一的 serv_from_dt,其中 serv_from_dt 和 serv_to_dt 之間的日期差異大於 0(我們稱之為 diff_date); 如果按行,serv_frm_dt >= person_id 的 MAX 唯一 diff_date,並且 serv_to_dt <= person_id 的 MAX 唯一 diff_date,則標記為該唯一 diff_date。 到目前為止我有這個:
values=ex[,.(uniqueN(sort(unique(serv_to_dt[ex$days_diff>0]), TRUE))), person_id]
n = as.numeric(values[,1])
m = as.numeric(values[,2])
for (i in m){
ex[,`:=`(min_start = fifelse((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[1] &
serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[1]),
sort(unique(serv_from_dt[ex$days_diff>0]))[1], fifelse((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[i] &
serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[i]),
sort(unique(serv_from_dt[ex$days_diff>0]))[i], serv_from_dt)),
max_end = fifelse((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[1] &
serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[1]),
sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[1], fifelse((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[i] &
serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[i]),
sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[i], serv_from_dt))), prs_nat_key]
}
上面的代碼正是我想要的,但我不知道如何為具有多個 person_ids 和多個 day_diffs 的更大數據集擴展它。 我希望代碼是這樣的,如果 serv_frm/serv_to_dts 在最大唯一 diff_date 之間不成立,則循環到下一個唯一 diff_date。 在這種情況下,兩個 person_id 都只有 1 個唯一的 diff_date(所以 m = 1),但我想更新代碼以在 m > 1 的情況下保持正確。我也嘗試使用 base R 來做,但不斷收到錯誤:
for(j in 1:m){
ex[, min_start := if((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[j] &
serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[j])) sort(unique(serv_from_dt[ex$days_diff>0]))[j]]
j = j+ 1
}
任何幫助將不勝感激。
我的最終目標是創建兩個名為 min_start 和 max_end 的新列。 我意識到我可以做一個連接而不是做 ifelse 語句。 以下是我使用稍大的示例數據集的步驟:
ex <- structure(list(person_id = c("79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8",
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8",
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8",
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8",
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65",
"8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65",
"8b6ea77b-e694-48fb-a9e9-ca8bf1accc65"), prs_nat_key = c("8240588160001",
"8240588160001", "8240588160001", "8240588160001", "8240588160001",
"8240588160001", "8240588160001", "8240588160001", "106705689",
"106705689", "106705689", "106705689"), serv_from_dt = structure(c(18262,
18262, 18262, 18262, 18275, 18275, 18275, 18275, 18278, 18278,
18278, 18278), class = "Date"), serv_to_dt = structure(c(18262,
18262, 18262, 18265, 18275, 18278, 18278, 18278, 18282, 18282,
18299, 18299), class = "Date"), new_pos = c("IP", "IP", "IP",
"IP", "IP", "IP", "IP", "IP", "IP", "IP", "IP", "IP"), days_diff = c(0,
0, 0, 3, 0, 3, 3, 3, 4, 4, 21, 21)), row.names = c(NA, -12L), class = c("data.table",
"data.frame"))
創建一個新的數據框,其中每個人只有唯一的開始/結束日期:
date_period <- ex[, .(unique_start = unique(serv_from_dt[days_diff>0]),
unique_end = unique(serv_to_dt[days_diff>0])), prs_nat_key][order(prs_nat_key,unique_start,-unique_end),]
date_period %<>% distinct(prs_nat_key, unique_start, .keep_all = TRUE) %>% setDT()
在此條件下進行左連接:如果 date_period$prs_nat_key = ex$prs_nat_key & ex$serv_from_dt >= date_period$unique_start & ex$serv_from_dt <= date_period$unique_end & ex$serv_to_dt >= date_period$unique_start & ex$serv_to_dt < $unique_end
ex[, c("start_date", "end_date") :=
date_period[ex, # join
.(unique_start, unique_end),
on = .(unique_start < serv_from_dt,
unique_start < serv_to_dt,
unique_end > serv_to_dt,
unique_end > serv_from_dt,
prs_nat_key = prs_nat_key)]]
我從這個問題中發現的 --> 數據表中的條件連接?
不確定你的最終結果應該是什么,但它看起來過於復雜。 例如,您創建的 date_period 表可以這樣完成:
ex[, .(unique_start = first(serv_from_dt), unique_end = last(serv_to_dt)), by = c("prs_nat_key", "serv_from_dt")]
# prs_nat_key serv_from_dt unique_start unique_end
# 1: 8240588160001 2020-01-01 2020-01-01 2020-01-04
# 2: 8240588160001 2020-01-14 2020-01-14 2020-01-17
# 3: 106705689 2020-01-17 2020-01-17 2020-02-07
由於您似乎試圖將其重新加入到原始表中,這也許正是您想要的。 是的,這就是您發布的原始表格所需的全部內容。
ex[, `:=` (start_date = first(serv_from_dt), end_date = last(serv_to_dt)), by = c("prs_nat_key", "serv_from_dt")]
# person_id prs_nat_key serv_from_dt serv_to_dt new_pos days_diff start_date end_date
# 1: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001 2020-01-01 2020-01-01 IP 0 2020-01-01 2020-01-04
# 2: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001 2020-01-01 2020-01-01 IP 0 2020-01-01 2020-01-04
# 3: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001 2020-01-01 2020-01-01 IP 0 2020-01-01 2020-01-04
# 4: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001 2020-01-01 2020-01-04 IP 3 2020-01-01 2020-01-04
# 5: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001 2020-01-14 2020-01-14 IP 0 2020-01-14 2020-01-17
# 6: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001 2020-01-14 2020-01-17 IP 3 2020-01-14 2020-01-17
# 7: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001 2020-01-14 2020-01-17 IP 3 2020-01-14 2020-01-17
# 8: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001 2020-01-14 2020-01-17 IP 3 2020-01-14 2020-01-17
# 9: 8b6ea77b-e694-48fb-a9e9-ca8bf1accc65 106705689 2020-01-17 2020-01-21 IP 4 2020-01-17 2020-02-07
# 10: 8b6ea77b-e694-48fb-a9e9-ca8bf1accc65 106705689 2020-01-17 2020-01-21 IP 4 2020-01-17 2020-02-07
# 11: 8b6ea77b-e694-48fb-a9e9-ca8bf1accc65 106705689 2020-01-17 2020-02-07 IP 21 2020-01-17 2020-02-07
# 12: 8b6ea77b-e694-48fb-a9e9-ca8bf1accc65 106705689 2020-01-17 2020-02-07 IP 21 2020-01-17 2020-02-07
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.