繁体   English   中英

使用R中同一列中不同行的值来估算缺失值

[英]Imputing Missing Values Using values from a different Row in the same Column in R

在此处输入图片说明

我有上表。 我想在“交易ID”下填写缺少的值。 填充此算法的算法如下:

  1. 用户ID“ kenn1”缺少两个交易ID,可以使用其他两个交易ID t1和t4来填充。

  2. 要选择在t1和t4之间使用哪一个,我查看事件时间。 第一个缺失值发生在9:30,距t1距30分钟,距t4距20分钟。 由于t4更接近该丢失值,因此将其填充为t4。 同样,对于第4行中的缺失值,距t1 45分钟,距t4 5分钟。 因此它将被替换为t4。

  3. 用户标识“ kenn2”缺少值的类似方法 在此处输入图片说明

如何在R中执行此操作?

可能有更好的解决方案,但我使用data.table编写了此解决方案:

library(data.table)
#Create Data Table, You can read.csv or read.xlsx etc
raw <- data.table(Event = paste0("e", 1:10),
                TransactionID = c("t1",NA,NA,"t4",NA,"t5","t6",NA,NA,"t8"),
                UserId = c(rep("kenn1",4), rep("kenn2",6)),
                EventTime = as.POSIXct(
                  c("2017-05-20 9:00", "2017-05-20 9:30", "2017-05-20 9:45", "2017-05-20 9:50", "2017-05-20 10:01",
                    "2017-05-20 10:02", "2017-05-20 10:03","2017-05-20 10:04","2017-05-20 10:05","2017-05-20 10:06")
                    , format="%Y-%m-%d %H:%M")
                )

transactionTimes <- raw[!is.na(TransactionID), .(TransactionID, EventTime)]
raw[, Above := na.locf(TransactionID, na.rm = F), UserId]
raw[, Below := na.locf(TransactionID, na.rm = F, fromLast = T), UserId]
raw <- merge(raw, transactionTimes[, .(Above = TransactionID, AboveTime = EventTime)], by="Above", all.x = T)
raw <- merge(raw, transactionTimes[, .(Below = TransactionID, BelowTime = EventTime)], by="Below", all.x = T)
raw[, AboveDiff := EventTime - AboveTime]
raw[, BelowDiff := BelowTime - EventTime]
raw[is.na(TransactionID) & is.na(AboveDiff), TransactionID := Below]
raw[is.na(TransactionID) & is.na(BelowDiff), TransactionID := Above]
raw[is.na(TransactionID), TransactionID := ifelse(AboveDiff <= BelowDiff, Above, Below)]
raw <- raw[, .(Event, TransactionID, UserId, EventTime)]
rm(transactionTimes)

使用data.table另一种解决方案。

library(data.table)
#Create Data Table, You can read.csv or read.xlsx etc
raw <- data.table(Event = paste0("e", 1:10),
                  TransactionID = c("t1",NA,NA,"t4",NA,"t5","t6",NA,NA,"t8"),
                  UserId = c(rep("kenn1",4), rep("kenn2",6)),
                  EventTime = as.POSIXct(
                    c("2017-05-20 9:00", "2017-05-20 9:30", "2017-05-20 9:45", "2017-05-20 9:50", "2017-05-20 10:01",
                      "2017-05-20 10:02", "2017-05-20 10:03","2017-05-20 10:04","2017-05-20 10:05","2017-05-20 10:06")
                    , format="%Y-%m-%d %H:%M")
)

#subset a rows without duplicates
raw_notNA <- raw[!is.na(TransactionID)] 
# merge the subset data with original (this will duplicate rows of originals with candiate rows)
merged <- merge(raw, raw_notNA, all.x = T, by = "UserId", allow.cartesian=TRUE) 
# calcuate time difference between original and candiate rows
merged[, DiffTime := abs(EventTime.x - EventTime.y)]
# create new Transaction IDs from the closest event 
merged[, NewTransactionID := TransactionID.y[DiffTime == min(DiffTime)], by = Event.x]
# remove the duplicaetd rows, and delete unnecesary columns
output <- merged[, .SD[1], by = Event.x][, list(Event.x, NewTransactionID, UserId, EventTime.x)]

names(output) <- names(raw)
print(output)

受此问题答案的启发(您的问题不是重复的,只是相似的)

R-在匹配的A,B和*最近* C上合并数据帧?

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM