R - 根据第二个数据帧中最接近的匹配来分配列值

Question

I have two data frames, logger and df (times are numeric): 我有两个数据框，logger和df（次数是数字）：

logger <- data.frame(
time = c(1280248354:1280248413),
temp = runif(60,min=18,max=24.5)
)

df <- data.frame(
obs = c(1:10),
time = runif(10,min=1280248354,max=1280248413),
temp = NA
)

I would like to search logger$time for the closest match to each row in df$time, and assign the associated logger$temp to df$temp. 我想在logf $ time中搜索与df $ time中每行最接近的匹配，并将相关的logger $ temp分配给df $ temp。 So far, I have been successful using the following loop: 到目前为止，我已成功使用以下循环：

for (i in 1:length(df$time)){
closestto<-which.min(abs((logger$time) - (df$time[i])))
df$temp[i]<-logger$temp[closestto]
}

However, I now have large data frames (logger has 13,620 rows and df has 266138) and processing times are long. 但是，我现在有大数据帧（记录器有13,620行，df有266138），处理时间很长。 I've read that loops are not the most efficient way to do things, but I am unfamiliar with alternatives. 我已经读过循环不是最有效的方法，但我不熟悉替代方案。 Is there a faster way to do this? 有更快的方法吗？

Answer 1

I'd use data.table for this. 我会使用data.table 。 It makes it super easy and super fast joining on keys . 它使得它非常容易且超快速地加入keys 。 There is even a really helpful roll = "nearest" argument for exactly the behaviour you are looking for (except in your example data it is not necessary because all times from df appear in logger ). 对于您正在寻找的行为，甚至还有一个非常有用的roll = "nearest"参数（除非您的示例数据中没有必要，因为df所有times都出现在logger ）。 In the following example I renamed df$time to df$time1 to make it clear which column belongs to which table... 在下面的示例中，我将df$time重命名为df$time1 ，以明确哪个列属于哪个表...

#  Load package
require( data.table )

#  Make data.frames into data.tables with a key column
ldt <- data.table( logger , key = "time" )
dt <- data.table( df , key = "time1" )

#  Join based on the key column of the two tables (time & time1)
#  roll = "nearest" gives the desired behaviour
#  list( obs , time1 , temp ) gives the columns you want to return from dt
ldt[ dt , list( obs , time1 , temp ) , roll = "nearest" ]
#          time obs      time1     temp
# 1: 1280248361   8 1280248361 18.07644
# 2: 1280248366   4 1280248366 21.88957
# 3: 1280248370   3 1280248370 19.09015
# 4: 1280248376   5 1280248376 22.39770
# 5: 1280248381   6 1280248381 24.12758
# 6: 1280248383  10 1280248383 22.70919
# 7: 1280248385   1 1280248385 18.78183
# 8: 1280248389   2 1280248389 18.17874
# 9: 1280248393   9 1280248393 18.03098
#10: 1280248403   7 1280248403 22.74372

Answer 2

You could use the data.table library. 您可以使用data.table库。 This will also help with being more efficient with large data size - 这也有助于提高数据大小的效率 -

library(data.table)

logger <- data.frame(
  time = c(1280248354:1280248413),
  temp = runif(60,min=18,max=24.5)
)

df <- data.frame(
  obs = c(1:10),
  time = runif(10,min=1280248354,max=1280248413)
)

logger <- data.table(logger)
df <- data.table(df)

setkey(df,time)
setkey(logger,time)

df2 <- logger[df, roll = "nearest"]

Output - 输出 -

> df2
          time     temp obs
 1: 1280248356 22.81437   7
 2: 1280248360 24.08711  10
 3: 1280248366 22.31738   2
 4: 1280248367 18.61222   5
 5: 1280248388 19.46300   4
 6: 1280248393 18.26535   6
 7: 1280248400 20.61901   9
 8: 1280248402 21.92584   1
 9: 1280248410 19.36526   8
10: 1280248410 19.36526   3

R - 根据第二个数据帧中最接近的匹配来分配列值

问题描述

2 个解决方案

解决方案1
5 已采纳 2013-11-13 15:44:53

解决方案2
1 2013-11-13 15:44:34

R - 根据第二个数据帧中最接近的匹配来分配列值

问题描述

2 个解决方案

解决方案1 5 已采纳 2013-11-13 15:44:53

解决方案2 1 2013-11-13 15:44:34

解决方案1
5 已采纳 2013-11-13 15:44:53

解决方案2
1 2013-11-13 15:44:34