简体   繁体   English

速度优化-使用距离矩阵计算data.table中的加权列

[英]Speed optimization - calculating weighted column in data.table with distance matrix

I am trying to apply weights to a numeric vector in a data.table . 我正在尝试将权重应用于data.table的数字矢量。 The weights come from the euclidean distances of each point with all the other points. 权重来自每个点与所有其他点的欧式距离。 If a point is close with another point, then the weights assigned to them will be higher, if the distance between 2 points are greater than a threshold then the weights will be 0, the weight assigned to the distance between a point and itself is 10000. 如果一个点与另一个点接近,则分配给它们的权重将更高,如果两个点之间的距离大于阈值,则权重将为0,分配给一个点与自身之间的距离的权重为10000 。

I can illustrate with the code below: 我可以用下面的代码说明:

library(data.table)
library(dplyr)
library(tictoc)

set.seed(42)
df <- data.table(
    LAT = rnorm(500, 42),
    LONG = rnorm(500, -72),
    points = rnorm(500)
    )
df2 <- copy(df) # for new solution
d <- as.matrix(dist(df[, .(LAT, LONG)])) # compute distance matrix

# function to calculate the weights
func <- function(j, cols, threshold) {
    N <- which(d[j, ] <= threshold) # find points whose distances are below threshold
    K <- (1 / (d[j, N] ^ 2)) # calculate weights, which are inversely proportional to distance, lower distance, higher the weight
    K[which(d[j, N] == 0)] <- 10000 # weight to itself is 10000
    return((K%*% as.matrix(df[N, ..cols])) / sum(K)) # compute weighted point for 1 row
}

tic('Old way')
# compute the weighted point calculation for every row
result <- tapply(1:nrow(df), 1:nrow(df), function(i) func(i, 'points', 0.5))
df[, 'weighted_points' := result] # assign the results back to data.table
toc()

The current function works well for small number of points, but it takes a lot longer to compute weighted points for about 220K rows. 当前函数适用于少量点,但是计算约220K行的加权点要花费更长的时间。

I have come up with another solution that cuts down the time in half, but I think it can still be improved. 我想出了另一种解决方案,可以将时间缩短一半,但我认为仍然可以改进。

d <- as.matrix(dist(df[, .(LAT, LONG)]))
df2[, 'weighted_points' := points]
dt <- as.data.table(d)
cols <- names(dt)

tic('New way')
# compute the weights
dt[, (cols) := lapply(.SD, function(x) case_when(
    x == 0 ~ 10000, 
    x <= 0.5 ~ 1 / (x^2), 
    TRUE ~ 0)), .SDcols = cols]

# compute the weighted point for each row
for (i in 1L:nrow(dt)) {
    set(df2, i, 'weighted_points', value = sum(df2[['points']] * dt[[i]]) / sum(dt[[i]])) 
}
toc()

round(sum(df$weighted_points - df2$weighted_points), 0)

The time differences may be small for this small data set, but I have tested the time using the real data set and the new way is quite a bit faster. 对于这个较小的数据集,时间差异可能很小,但是我已经使用实际数据集测试了时间,并且新方法要快得多。

My question is, how can I make the new approach to be even faster? 我的问题是,如何使新方法更快? I know I am using case_when from dplyr which could make things slower in exchange for readability, but are there other things that I am not doing correctly in data.table that could help make it faster? 我知道我正在使用case_whendplyr ,这可能会使事情变慢,以换取可读性,但是在data.table中是否还有其他我做得不好的事情,这可能有助于使事情变得更快?

From data analyst side I think you could improve your code with an approximation for what mean distance and close points. 从数据分析师的角度来看,我认为您可以通过近似表示平均距离和闭合点来改进代码。

Once I worked with NCDC station locations and tried to find closes stations for each other because there were so many stations it was time-consuming. 一旦我与NCDC站点位置一起工作,并试图找到彼此关闭的站点,因为存在太多的站点,这非常耗时。 I came up with an idea that after I get dist of my coordinates of each point I just rank them up and put up threshold "how many stations I want to take for real weight calculation". 我想出了一个主意后,我得到dist我的每一个点的坐标,我只是对他们进行排名并竖起门槛“我有多少个站点要采取实际重量计算”。

For example, after ranking take 50 closest points (within the rank) and put them weights respectively, other points will just get 0 weight. 例如,对排名取50个最接近的点(在排名内)并分别赋予其权重后,其他点的权重将为0。

Hope this helps 希望这可以帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM