[英]Speed optimization - calculating weighted column in data.table with distance matrix
I am trying to apply weights to a numeric vector in a data.table
. 我正在尝试将权重应用于data.table
的数字矢量。 The weights come from the euclidean distances of each point with all the other points. 权重来自每个点与所有其他点的欧式距离。 If a point is close with another point, then the weights assigned to them will be higher, if the distance between 2 points are greater than a threshold then the weights will be 0, the weight assigned to the distance between a point and itself is 10000. 如果一个点与另一个点接近,则分配给它们的权重将更高,如果两个点之间的距离大于阈值,则权重将为0,分配给一个点与自身之间的距离的权重为10000 。
I can illustrate with the code below: 我可以用下面的代码说明:
library(data.table)
library(dplyr)
library(tictoc)
set.seed(42)
df <- data.table(
LAT = rnorm(500, 42),
LONG = rnorm(500, -72),
points = rnorm(500)
)
df2 <- copy(df) # for new solution
d <- as.matrix(dist(df[, .(LAT, LONG)])) # compute distance matrix
# function to calculate the weights
func <- function(j, cols, threshold) {
N <- which(d[j, ] <= threshold) # find points whose distances are below threshold
K <- (1 / (d[j, N] ^ 2)) # calculate weights, which are inversely proportional to distance, lower distance, higher the weight
K[which(d[j, N] == 0)] <- 10000 # weight to itself is 10000
return((K%*% as.matrix(df[N, ..cols])) / sum(K)) # compute weighted point for 1 row
}
tic('Old way')
# compute the weighted point calculation for every row
result <- tapply(1:nrow(df), 1:nrow(df), function(i) func(i, 'points', 0.5))
df[, 'weighted_points' := result] # assign the results back to data.table
toc()
The current function works well for small number of points, but it takes a lot longer to compute weighted points for about 220K rows. 当前函数适用于少量点,但是计算约220K行的加权点要花费更长的时间。
I have come up with another solution that cuts down the time in half, but I think it can still be improved. 我想出了另一种解决方案,可以将时间缩短一半,但我认为仍然可以改进。
d <- as.matrix(dist(df[, .(LAT, LONG)]))
df2[, 'weighted_points' := points]
dt <- as.data.table(d)
cols <- names(dt)
tic('New way')
# compute the weights
dt[, (cols) := lapply(.SD, function(x) case_when(
x == 0 ~ 10000,
x <= 0.5 ~ 1 / (x^2),
TRUE ~ 0)), .SDcols = cols]
# compute the weighted point for each row
for (i in 1L:nrow(dt)) {
set(df2, i, 'weighted_points', value = sum(df2[['points']] * dt[[i]]) / sum(dt[[i]]))
}
toc()
round(sum(df$weighted_points - df2$weighted_points), 0)
The time differences may be small for this small data set, but I have tested the time using the real data set and the new way is quite a bit faster. 对于这个较小的数据集,时间差异可能很小,但是我已经使用实际数据集测试了时间,并且新方法要快得多。
My question is, how can I make the new approach to be even faster? 我的问题是,如何使新方法更快? I know I am using case_when
from dplyr
which could make things slower in exchange for readability, but are there other things that I am not doing correctly in data.table
that could help make it faster? 我知道我正在使用case_when
的dplyr
,这可能会使事情变慢,以换取可读性,但是在data.table
中是否还有其他我做得不好的事情,这可能有助于使事情变得更快?
From data analyst side I think you could improve your code with an approximation for what mean distance and close points. 从数据分析师的角度来看,我认为您可以通过近似表示平均距离和闭合点来改进代码。
Once I worked with NCDC station locations and tried to find closes stations for each other because there were so many stations it was time-consuming. 一旦我与NCDC站点位置一起工作,并试图找到彼此关闭的站点,因为存在太多的站点,这非常耗时。 I came up with an idea that after I get dist
of my coordinates of each point I just rank them up and put up threshold "how many stations I want to take for real weight calculation". 我想出了一个主意后,我得到dist
我的每一个点的坐标,我只是对他们进行排名并竖起门槛“我有多少个站点要采取实际重量计算”。
For example, after ranking take 50 closest points (within the rank) and put them weights respectively, other points will just get 0 weight. 例如,对排名取50个最接近的点(在排名内)并分别赋予其权重后,其他点的权重将为0。
Hope this helps 希望这可以帮助
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.