[英]R left join by multiple columns resulting in NAs
我有兩個數據框alpha
和beta
。
dput(alpha)
structure(list(ID = c(29503L, 29507L, 29508L, 29510L),
Q_ID = structure(1:4, .Label = c("q:1392763916495:441", "q:1392763916495:445", "q:1392763916495:449", "q:1392763920794:458"),
class = "factor"),
L_Atmpt = c(0L, 0L, 0L, 0L),
Q_Atmpt = c(0L, 1L, 0L, 1L),
Q_Result = c(1L, 1L, 1L, 0L),
Time_on_Screen = c(13839L, 185162L, 264418L, 2183464L),
Start_Time = structure(1:4, .Label = c("2017-10-31Ê11:51:20", "2017-10-31Ê11:54:26", "2017-10-31Ê11:59:09", "2017-10-31Ê12:35:34"),
class = "factor"),
End_Time = structure(1:4, .Label = c("2017-10-31Ê11:51:33", "2017-10-31Ê11:57:31", "2017-10-1Ê12:03:33", "2017-10-31Ê13:11:57"),
class = "factor"),
Duration = c(173L, 55L, 98L, 1921L)),
class = "data.frame", row.names = c(NA, -4L))
dput(beta)
structure(list(ID = c(29503L, 29507L, 29508L, 29510L, 29515L, 30160L),
Q_ID = structure(1:6, .Label = c("q:1392763916495:441", "q:1392763916495:445", "q:1392763916495:449", "q:1392763920794:458", "q:1392763920794:462", "q:1392763925803:530"),
class = "factor"),
L_Atmpt = c(0L, 0L, 0L, 0L, 0L, 1L),
Q_Atmpt = c(0L, 1L, 0L, 1L, 0L, 0L),
Q_Result = c(1L, 1L, 1L, 0L, 0L, 0L),
Time_on_Screen = c(13839L, 185162L, 264418L, 2183464L, 768470L, 885800L),
Start_Time = structure(c(2L, 3L, 4L, 5L, 6L, 1L), .Label = c("2017-10-25Ê00:19:08", "2017-10-31Ê11:51:20", "2017-10-31Ê11:54:26", "2017-10-31Ê11:59:09", "2017-10-31Ê12:35:34", "2017-10-31Ê13:16:09"),
class = "factor"),
End_Time = structure(c(2L, 3L, 4L, 5L, 6L, 1L), .Label = c("2017-10-25Ê00:33:53", "2017-10-31Ê11:51:33", "2017-10-31Ê11:57:31", "2017-10-31Ê12:03:33", "2017-10-31Ê13:11:57", "2017-10-31Ê13:28:57"),
class = "factor")),
class = "data.frame", row.names = c(NA,-6L))
我想合並它們並獲得最終的數據幀gamma
。 數據幀alpha
有一個特殊的列: alpha$duration
,我需要在數據幀beta
的末尾添加或 append 。
beta
比alpha
有更多的實例,我想執行左連接,以便保留beta
的所有實例。 這意味着gamma$duration
列的某些條目將是NULL
或NA
。
我預計, NULL
或NA
將是那些alpha
的 ID 與beta
的 ID 不匹配的條目。 但是,對於我的原始數據(具有超過 10K 行和大約 20 個左右的變量),我得到如下內容:
ID Q_ID L_Atmpt Q_Atmpt Q_Result Time_on_Screen Start_Time End_Time Duration
29503 q:1392763916495:441 0 0 1 13839 2017-10-31Ê11:51:20 2017-10-31Ê11:51:33 NA
29507 q:1392763916495:445 0 1 1 185162 2017-10-31Ê11:54:26 2017-10-31Ê11:57:31 NA
29508 q:1392763916495:449 0 0 1 264418 2017-10-31Ê11:59:09 2017-10-31Ê12:03:33 NA
29510 q:1392763920794:458 0 1 0 2183464 2017-10-31Ê12:35:34 2017-10-31Ê13:11:57 NA
29515 q:1392763920794:462 0 0 0 768470 2017-10-31Ê13:16:09 2017-10-31Ê13:28:57 NA
30160 q:1392763925803:530 1 0 0 885800 2017-10-25Ê00:19:08 2017-10-25Ê00:33:53 NA
不幸的是,我分享的玩具示例並沒有復制/捕獲我的問題。 我知道想象為什么我在最初的問題中得到NA
可能具有挑戰性。 但對此的任何想法或建議將不勝感激。
作為參考,我分享了我使用過的不同腳本,它們都呈現了相同的 output:
library(plyr)
gamma = join(beta, alpha, type = "left")
library(dplyr)
gamma = left_join(beta, alpha)
library(sqldf)
gamma = sqldf('SELECT beta.*, alpha.duration
FROM beta LEFT JOIN alpha
on beta.ID == alpha.ID AND
beta.Q_ID == alpha.Q_ID AND
beta.L_Atmpt == alpha.L_Atmpt AND
beta.Q_Atmpt == alpha.Q_Atmpt AND
beta.Start_Time == alpha.Start_Time')
我想提一下,我的原始數據框中的列alpha$duration
是在一些預處理步驟之后創建的,例如:
#Step 1: Ordering the data by ID and Start_Time
beta = beta[with(beta, order(ID, Q_ID, Q_Atmpt, Start_Time)), ]
#Step 2: End_Time lagging
library(Hmisc)
# to calculate the time difference we lag the End_Time
beta$End_Time_forward = Lag(beta$End_Time, +1)
# for comparisons, we also lag the IDs
beta$ID_forward = Lag(beta$ID, +1)
#Step 3: Now calculate the required time differences
library(sqldf)
alpha = sqldf('SELECT beta.*,
(Start_Time - End_Time_forward),
(End_Time - End_Time_forward)
FROM beta
WHERE ID_forward == ID')
#Step 4: Columns renaming
names(alpha)[names(alpha) == "(Start_Time - End_Time_forward)"] = "duration"
names(alpha)[names(alpha) == "(End_Time - End_Time_forward)"] = "end_duration"
#Step 5:Few instances have negative duration, so replace the gap between
# (last end time and current start time) with the (last end time and current
# end time) difference
alpha = alpha %>%
mutate(duration = if_else(duration < 0, end_duration, duration))
#Step 6: Convert the remaining negatives with NAs
alpha$duration[alpha$duration < 0] <- NA
#Step 7: Now replace those NAs by using the imputeTS function
library(imputeTS)
alpha$duration = na_locf(alpha$duration, option = 'locf',
na_remaining = 'rev', maxgap = Inf)
我懷疑,我操縱gamma$duration
變量的最后兩個步驟可能與這種意外結果有關
我無法確定此問題的實際原因,但是,我找到了解決此問題的方法:
beta$duration = as.integer(0)
test2 = merge(x = beta, y = alpha,
by = c("ID", "Q_ID", "L_Atmpt", "Q_Atmpt", "Q_Result", "Time_on_Screen", "Start_Time", "End_Time"),
all.x = TRUE)
通過這個,我可以訪問/保留數據框alpha
的duration
列,然后根據需要使用它。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.