[英]R - Assign column value based on closest match in second data frame
I have two data frames, logger and df (times are numeric): 我有两个数据框,logger和df(次数是数字):
logger <- data.frame(
time = c(1280248354:1280248413),
temp = runif(60,min=18,max=24.5)
)
df <- data.frame(
obs = c(1:10),
time = runif(10,min=1280248354,max=1280248413),
temp = NA
)
I would like to search logger$time for the closest match to each row in df$time, and assign the associated logger$temp to df$temp. 我想在logf $ time中搜索与df $ time中每行最接近的匹配,并将相关的logger $ temp分配给df $ temp。 So far, I have been successful using the following loop:
到目前为止,我已成功使用以下循环:
for (i in 1:length(df$time)){
closestto<-which.min(abs((logger$time) - (df$time[i])))
df$temp[i]<-logger$temp[closestto]
}
However, I now have large data frames (logger has 13,620 rows and df has 266138) and processing times are long. 但是,我现在有大数据帧(记录器有13,620行,df有266138),处理时间很长。 I've read that loops are not the most efficient way to do things, but I am unfamiliar with alternatives.
我已经读过循环不是最有效的方法,但我不熟悉替代方案。 Is there a faster way to do this?
有更快的方法吗?
I'd use data.table
for this. 我会使用
data.table
。 It makes it super easy and super fast joining on keys
. 它使得它非常容易且超快速地加入
keys
。 There is even a really helpful roll = "nearest"
argument for exactly the behaviour you are looking for (except in your example data it is not necessary because all times
from df
appear in logger
). 对于您正在寻找的行为,甚至还有一个非常有用的
roll = "nearest"
参数(除非您的示例数据中没有必要,因为df
所有times
都出现在logger
)。 In the following example I renamed df$time
to df$time1
to make it clear which column belongs to which table... 在下面的示例中,我将
df$time
重命名为df$time1
,以明确哪个列属于哪个表...
# Load package
require( data.table )
# Make data.frames into data.tables with a key column
ldt <- data.table( logger , key = "time" )
dt <- data.table( df , key = "time1" )
# Join based on the key column of the two tables (time & time1)
# roll = "nearest" gives the desired behaviour
# list( obs , time1 , temp ) gives the columns you want to return from dt
ldt[ dt , list( obs , time1 , temp ) , roll = "nearest" ]
# time obs time1 temp
# 1: 1280248361 8 1280248361 18.07644
# 2: 1280248366 4 1280248366 21.88957
# 3: 1280248370 3 1280248370 19.09015
# 4: 1280248376 5 1280248376 22.39770
# 5: 1280248381 6 1280248381 24.12758
# 6: 1280248383 10 1280248383 22.70919
# 7: 1280248385 1 1280248385 18.78183
# 8: 1280248389 2 1280248389 18.17874
# 9: 1280248393 9 1280248393 18.03098
#10: 1280248403 7 1280248403 22.74372
You could use the data.table
library. 您可以使用
data.table
库。 This will also help with being more efficient with large data size - 这也有助于提高数据大小的效率 -
library(data.table)
logger <- data.frame(
time = c(1280248354:1280248413),
temp = runif(60,min=18,max=24.5)
)
df <- data.frame(
obs = c(1:10),
time = runif(10,min=1280248354,max=1280248413)
)
logger <- data.table(logger)
df <- data.table(df)
setkey(df,time)
setkey(logger,time)
df2 <- logger[df, roll = "nearest"]
Output - 输出 -
> df2
time temp obs
1: 1280248356 22.81437 7
2: 1280248360 24.08711 10
3: 1280248366 22.31738 2
4: 1280248367 18.61222 5
5: 1280248388 19.46300 4
6: 1280248393 18.26535 6
7: 1280248400 20.61901 9
8: 1280248402 21.92584 1
9: 1280248410 19.36526 8
10: 1280248410 19.36526 3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.