[英]find the shortest time difference between two dataframes
假設我有兩個數據幀,
df1
id time1
1 2016-04-07 21:39:10
1 2016-04-05 11:19:17
2 2016-04-03 10:58:25
2 2016-04-02 21:39:10
df2
id time2
1 2016-04-07 21:39:11
1 2016-04-05 11:19:18
1 2016-04-06 21:39:11
1 2016-04-04 11:19:18
2 2016-04-03 10:58:26
2 2016-04-02 21:39:11
2 2016-04-04 10:58:26
2 2016-04-05 21:39:11
我想為df1中的每個條目找到df2中最短的時間差。 假設我們取第一個條目,它有id 1,所以我想遍歷df2,過濾id 1,然后檢查df1的一個條目和df2的剩余條目之間的時間差,找到最短的差異並獲取相應的條目。 我的樣本輸出應該是
id time time2 diff(in secs)
1 2016-04-07 21:39:10 2016-04-07 21:39:10 1
1 2016-04-05 11:19:17 2016-04-05 11:19:17 1
2 2016-04-03 10:58:25 2016-04-03 10:58:25 1
2 2016-04-02 21:39:10 2016-04-02 21:39:10 1
以下是我的嘗試,
for(i in unique(df1$id)){
temp1 = df1[df1$id == i,]
temp2 = df2[df2$id == i,]
for(j in unique(df1$time1){
for(k in unique(df2$time2){
diff = abs(df1$time1[j] - df2$time2[k]
print(diff)}}}
在此之后我無法進步,遇到很多錯誤。 任何人都可以幫我糾正這個嗎? 可能會建議一個更有效的方法來做到這一點? 任何幫助,將不勝感激。
更新:
可再現數據:
df1 <- data.frame(
id = c(1,1,2,2),
time1 = c('2016-04-07 21:39:10', '2016-04-05 11:19:17', '2016-04-03 10:58:25', '2016-04-02 21:39:10')
)
df2 <- data.frame(
id = c(1,1,1,1,2,2,2,2),
time2 = c('2016-04-07 21:39:11', '2016-04-05 11:19:18','2016-04-07 21:39:11', '2016-04-05 11:19:18', '2016-04-03 10:58:26', '2016-04-02 21:39:11','2016-04-03 10:58:26', '2016-04-02 21:39:11')
)
df1$time1 = as.POSIXct(df1$time1)
df2$time2 = as.POSIXct(df2$time2)
您可以使用dplyr
實現此dplyr
。 基本上這個想法是因為我們想要生成一個條目,我們將為df1
的每個元素分配一個新的id(在本例中我稱之為rowname)。
在此之后,我們感興趣的是加入id
上的兩個數據幀並根據最小絕對差值對它們進行過濾。
library(dplyr)
df1$time1 <- as.POSIXct(as.character(df1$time1))
df2$time2 <- as.POSIXct(as.character(df2$time2))
df1 %>%
add_rownames("rowname") %>%
left_join(df2, "id") %>%
mutate(diff=time2-time1) %>%
group_by(rowname) %>%
filter(min(abs(diff)) == abs(diff)) %>%
distinct
這是我的輸出:
Source: local data frame [4 x 5]
Groups: rowname [4]
rowname id time1 time2 diff
(chr) (dbl) (time) (time) (dfft)
1 1 1 2016-04-07 21:39:10 2016-04-07 21:39:11 1 secs
2 2 1 2016-04-05 11:19:17 2016-04-05 11:19:18 1 secs
3 3 2 2016-04-03 10:58:25 2016-04-03 10:58:26 1 secs
4 4 2 2016-04-02 21:39:10 2016-04-02 21:39:11 1 secs
您也可以在基礎R中執行此操作。要生成隨機日期(有用),我從StackOverflow上的其他地方借用並編輯了一個很好的函數:
latemail <- function(N, st="2011/01/01", et="2016/12/31") {
st <- as.POSIXct(as.Date(st))
et <- as.POSIXct(as.Date(et))
dt <- as.numeric(difftime(et,st,unit="sec"))
ev <- sort(runif(N, 0, dt))
return(st + ev)
}
df1 <- data.frame(id=c(1,1,2,2), time1=latemail(4))
df2 <- data.frame(id=c(rep(1,4), rep(2,4)), time2=latemail(8))
然后您的答案可以分為兩行:
shortest <- sapply(df1$time1, function(x) which(abs(df2$time2 - x) == min(abs(df2$time2 - x))))
cbind(df1, df2[shortest,])
輸出:
id time1 id time2
1 2011-10-08 02:00:21 1 2011-08-17 18:07:47
1 2012-05-06 17:49:03 1 2012-09-04 19:52:40
2 2013-10-29 13:14:51 1 2012-10-29 20:09:31
2 2016-06-17 19:23:43 2 2015-11-24 02:07:15
如果您使用data.table
:
library(data.table)
df1 <- data.table(
id = c(1,1,2,2),
time1 = c('2016-04-07 21:39:10', '2016-04-05 11:19:17', '2016-04-03 10:58:25', '2016-04-02 21:39:10')
)
df2 <- data.table(
id = c(1,1,1,1,2,2,2,2),
time2 = c('2016-04-07 21:39:11', '2016-04-05 11:19:18','2016-04-07 21:39:11', '2016-04-05 11:19:18', '2016-04-03 10:58:26', '2016-04-02 21:39:11','2016-04-03 10:58:26', '2016-04-02 21:39:11')
)
df1$time1 = as.POSIXct(df1$time1)
df2$time2 = as.POSIXct(df2$time2)
res <- df1[df2, .(time1, time2), by = .EACHI, on = "id"][, diff:= abs(time2 -time1)]
setkey(res, id, time1, diff)
res <- res[, row := seq_along(.I), by = .(id, time1)][row == 1]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.