使用R中同一列中不同行的值来估算缺失值

Question

我有上表。 我想在“交易ID”下填写缺少的值。 填充此算法的算法如下：

用户ID“ kenn1”缺少两个交易ID，可以使用其他两个交易ID t1和t4来填充。
要选择在t1和t4之间使用哪一个，我查看事件时间。 第一个缺失值发生在9:30，距t1距30分钟，距t4距20分钟。 由于t4更接近该丢失值，因此将其填充为t4。 同样，对于第4行中的缺失值，距t1 45分钟，距t4 5分钟。 因此它将被替换为t4。
用户标识“ kenn2”缺少值的类似方法

如何在R中执行此操作？

Answer 1

可能有更好的解决方案，但我使用data.table编写了此解决方案：

library(data.table)
#Create Data Table, You can read.csv or read.xlsx etc
raw <- data.table(Event = paste0("e", 1:10),
                TransactionID = c("t1",NA,NA,"t4",NA,"t5","t6",NA,NA,"t8"),
                UserId = c(rep("kenn1",4), rep("kenn2",6)),
                EventTime = as.POSIXct(
                  c("2017-05-20 9:00", "2017-05-20 9:30", "2017-05-20 9:45", "2017-05-20 9:50", "2017-05-20 10:01",
                    "2017-05-20 10:02", "2017-05-20 10:03","2017-05-20 10:04","2017-05-20 10:05","2017-05-20 10:06")
                    , format="%Y-%m-%d %H:%M")
                )

transactionTimes <- raw[!is.na(TransactionID), .(TransactionID, EventTime)]
raw[, Above := na.locf(TransactionID, na.rm = F), UserId]
raw[, Below := na.locf(TransactionID, na.rm = F, fromLast = T), UserId]
raw <- merge(raw, transactionTimes[, .(Above = TransactionID, AboveTime = EventTime)], by="Above", all.x = T)
raw <- merge(raw, transactionTimes[, .(Below = TransactionID, BelowTime = EventTime)], by="Below", all.x = T)
raw[, AboveDiff := EventTime - AboveTime]
raw[, BelowDiff := BelowTime - EventTime]
raw[is.na(TransactionID) & is.na(AboveDiff), TransactionID := Below]
raw[is.na(TransactionID) & is.na(BelowDiff), TransactionID := Above]
raw[is.na(TransactionID), TransactionID := ifelse(AboveDiff <= BelowDiff, Above, Below)]
raw <- raw[, .(Event, TransactionID, UserId, EventTime)]
rm(transactionTimes)

Answer 2

使用data.table另一种解决方案。

library(data.table)
#Create Data Table, You can read.csv or read.xlsx etc
raw <- data.table(Event = paste0("e", 1:10),
                  TransactionID = c("t1",NA,NA,"t4",NA,"t5","t6",NA,NA,"t8"),
                  UserId = c(rep("kenn1",4), rep("kenn2",6)),
                  EventTime = as.POSIXct(
                    c("2017-05-20 9:00", "2017-05-20 9:30", "2017-05-20 9:45", "2017-05-20 9:50", "2017-05-20 10:01",
                      "2017-05-20 10:02", "2017-05-20 10:03","2017-05-20 10:04","2017-05-20 10:05","2017-05-20 10:06")
                    , format="%Y-%m-%d %H:%M")
)

#subset a rows without duplicates
raw_notNA <- raw[!is.na(TransactionID)] 
# merge the subset data with original (this will duplicate rows of originals with candiate rows)
merged <- merge(raw, raw_notNA, all.x = T, by = "UserId", allow.cartesian=TRUE) 
# calcuate time difference between original and candiate rows
merged[, DiffTime := abs(EventTime.x - EventTime.y)]
# create new Transaction IDs from the closest event 
merged[, NewTransactionID := TransactionID.y[DiffTime == min(DiffTime)], by = Event.x]
# remove the duplicaetd rows, and delete unnecesary columns
output <- merged[, .SD[1], by = Event.x][, list(Event.x, NewTransactionID, UserId, EventTime.x)]

names(output) <- names(raw)
print(output)

受此问题答案的启发（您的问题不是重复的，只是相似的）

R-在匹配的A，B和*最近* C上合并数据帧？

使用R中同一列中不同行的值来估算缺失值

问题描述

2 个解决方案

解决方案1
0 2017-05-20 14:43:13

解决方案2
0 2017-05-20 15:54:25

使用R中同一列中不同行的值来估算缺失值

问题描述

2 个解决方案

解决方案1 0 2017-05-20 14:43:13

解决方案2 0 2017-05-20 15:54:25

解决方案1
0 2017-05-20 14:43:13

解决方案2
0 2017-05-20 15:54:25