R 通过多列左连接导致 NA

Question

我有两个数据框alpha和beta 。

dput(alpha)

structure(list(ID = c(29503L, 29507L, 29508L, 29510L), 
               Q_ID = structure(1:4, .Label = c("q:1392763916495:441", "q:1392763916495:445", "q:1392763916495:449", "q:1392763920794:458"), 
                  class = "factor"), 
               L_Atmpt = c(0L, 0L, 0L, 0L), 
               Q_Atmpt = c(0L, 1L, 0L, 1L), 
               Q_Result = c(1L, 1L, 1L, 0L), 
               Time_on_Screen = c(13839L, 185162L, 264418L, 2183464L), 
               Start_Time = structure(1:4, .Label = c("2017-10-31Ê11:51:20", "2017-10-31Ê11:54:26", "2017-10-31Ê11:59:09", "2017-10-31Ê12:35:34"), 
                  class = "factor"), 
               End_Time = structure(1:4, .Label = c("2017-10-31Ê11:51:33", "2017-10-31Ê11:57:31", "2017-10-1Ê12:03:33", "2017-10-31Ê13:11:57"), 
                  class = "factor"), 
               Duration = c(173L, 55L, 98L, 1921L)), 
               class = "data.frame", row.names = c(NA, -4L))

dput(beta)
structure(list(ID = c(29503L, 29507L, 29508L, 29510L, 29515L, 30160L), 
               Q_ID = structure(1:6, .Label = c("q:1392763916495:441", "q:1392763916495:445", "q:1392763916495:449", "q:1392763920794:458", "q:1392763920794:462", "q:1392763925803:530"), 
                 class = "factor"), 
               L_Atmpt = c(0L, 0L, 0L, 0L, 0L, 1L), 
               Q_Atmpt = c(0L, 1L, 0L, 1L, 0L, 0L), 
               Q_Result = c(1L, 1L, 1L, 0L, 0L, 0L), 
               Time_on_Screen = c(13839L, 185162L, 264418L, 2183464L, 768470L, 885800L), 
               Start_Time = structure(c(2L, 3L, 4L, 5L, 6L, 1L), .Label = c("2017-10-25Ê00:19:08", "2017-10-31Ê11:51:20", "2017-10-31Ê11:54:26", "2017-10-31Ê11:59:09", "2017-10-31Ê12:35:34", "2017-10-31Ê13:16:09"), 
                 class = "factor"), 
               End_Time = structure(c(2L, 3L, 4L, 5L, 6L, 1L), .Label = c("2017-10-25Ê00:33:53", "2017-10-31Ê11:51:33", "2017-10-31Ê11:57:31", "2017-10-31Ê12:03:33", "2017-10-31Ê13:11:57", "2017-10-31Ê13:28:57"), 
                 class = "factor")), 
               class = "data.frame", row.names = c(NA,-6L))

我想合并它们并获得最终的数据帧gamma 。 数据帧alpha有一个特殊的列： alpha$duration ，我需要在数据帧beta的末尾添加或 append 。

beta比alpha有更多的实例，我想执行左连接，以便保留beta的所有实例。 这意味着gamma$duration列的某些条目将是NULL或NA 。

我预计， NULL或NA将是那些alpha的 ID 与beta的 ID 不匹配的条目。 但是，对于我的原始数据（具有超过 10K 行和大约 20 个左右的变量），我得到如下内容：

ID    Q_ID               L_Atmpt Q_Atmpt Q_Result Time_on_Screen Start_Time End_Time        Duration
29503 q:1392763916495:441   0   0   1   13839   2017-10-31Ê11:51:20 2017-10-31Ê11:51:33 NA  
29507 q:1392763916495:445   0   1   1   185162  2017-10-31Ê11:54:26 2017-10-31Ê11:57:31 NA  
29508 q:1392763916495:449   0   0   1   264418  2017-10-31Ê11:59:09 2017-10-31Ê12:03:33 NA  
29510 q:1392763920794:458   0   1   0   2183464 2017-10-31Ê12:35:34 2017-10-31Ê13:11:57 NA  
29515 q:1392763920794:462   0   0   0   768470  2017-10-31Ê13:16:09 2017-10-31Ê13:28:57 NA  
30160 q:1392763925803:530   1   0   0   885800  2017-10-25Ê00:19:08 2017-10-25Ê00:33:53 NA

不幸的是，我分享的玩具示例并没有复制/捕获我的问题。 我知道想象为什么我在最初的问题中得到NA可能具有挑战性。 但对此的任何想法或建议将不胜感激。

作为参考，我分享了我使用过的不同脚本，它们都呈现了相同的 output：

library(plyr)
gamma = join(beta, alpha, type = "left")

library(dplyr)
gamma = left_join(beta, alpha)

library(sqldf)
gamma = sqldf('SELECT beta.*, alpha.duration
               FROM beta LEFT JOIN alpha
               on beta.ID == alpha.ID AND
               beta.Q_ID == alpha.Q_ID AND
               beta.L_Atmpt == alpha.L_Atmpt AND
               beta.Q_Atmpt == alpha.Q_Atmpt AND
               beta.Start_Time == alpha.Start_Time')

我想提一下，我的原始数据框中的列alpha$duration是在一些预处理步骤之后创建的，例如：

#Step 1: Ordering the data by ID and Start_Time
beta = beta[with(beta, order(ID, Q_ID, Q_Atmpt, Start_Time)), ]

#Step 2: End_Time lagging
library(Hmisc)
# to calculate the time difference we lag the End_Time
beta$End_Time_forward = Lag(beta$End_Time, +1)
# for comparisons, we also lag the IDs
beta$ID_forward = Lag(beta$ID, +1)

#Step 3: Now calculate the required time differences
library(sqldf)
alpha = sqldf('SELECT beta.*, 
                (Start_Time - End_Time_forward), 
                (End_Time - End_Time_forward)
              FROM beta
              WHERE ID_forward == ID')

#Step 4: Columns renaming
names(alpha)[names(alpha) == "(Start_Time - End_Time_forward)"] = "duration"
names(alpha)[names(alpha) == "(End_Time - End_Time_forward)"] = "end_duration"

#Step 5:Few instances have negative duration, so replace the gap between 
# (last end time and current start time) with the (last end time and current 
# end time) difference
alpha =  alpha %>%
  mutate(duration = if_else(duration < 0, end_duration, duration))

#Step 6: Convert the remaining negatives with NAs
alpha$duration[alpha$duration < 0] <- NA

#Step 7: Now replace those NAs by using the imputeTS function
library(imputeTS)
alpha$duration = na_locf(alpha$duration, option = 'locf', 
                         na_remaining = 'rev', maxgap = Inf)

我怀疑，我操纵gamma$duration变量的最后两个步骤可能与这种意外结果有关

Answer 1

我无法确定此问题的实际原因，但是，我找到了解决此问题的方法：

beta$duration = as.integer(0)

test2 = merge(x = beta, y = alpha, 
by = c("ID", "Q_ID", "L_Atmpt", "Q_Atmpt", "Q_Result", "Time_on_Screen", "Start_Time", "End_Time"),
all.x = TRUE)

通过这个，我可以访问/保留数据框alpha的duration列，然后根据需要使用它。

R 通过多列左连接导致 NA

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-10-27 10:51:31

R 通过多列左连接导致 NA

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-10-27 10:51:31

解决方案1
0 已采纳 2019-10-27 10:51:31