R：大數據的區別？計算兩個矩陣之間的最小距離

Question

我有兩個矩陣，一個是200K行，另一個是20K。 對於第一個矩陣中的每一行（也就是一個點），我試圖找到第二個矩陣中哪一行（也是一個點）最接近第一個矩陣中的點。 這是我在樣本數據集上嘗試的第一種方法：

#Test dataset
pixels.latlon=cbind(runif(200000,min=-180, max=-120), runif(200000, min=50, max=85))
grwl.latlon=cbind(runif(20000,min=-180, max=-120), runif(20000, min=50, max=85))
#calculate the distance matrix
library(geosphere)
dist.matrix=distm(pixels.latlon, grwl.latlon, fun=distHaversine)
#Pick out the indices of the minimum distance
rnum=apply(dist.matrix, 1, which.min)

但是，我得到一個Error: cannot allocate vector of size 30.1 Gb當我使用distm函數時， Error: cannot allocate vector of size 30.1 Gb錯誤。

關於這個主題有幾個帖子：

這個使用bigmemory來計算SAME數據幀中各點之間的距離，但我不知道如何調整它以計算兩個不同矩陣中點之間的距離... https://stevemosher.wordpress.com/2012/04/ 12 /缺口司托克斯-距離代碼現在與-大存儲器/

這個也適用於計算SAME矩陣中各點之間的距離矩陣... 用於重復距離矩陣計算的高效（記憶方式）函數和超大距離矩陣的分塊

這個與我想做的幾乎完全相同，但他們實際上沒有提出適用於大數據的解決方案： R：使用大內存的distm我試過這個方法，它使用bigmemory ，但得到一個Error in CreateFileBackedBigMatrix(as.character(backingfile), as.character(backingpath), : Problem creating filebacked matrix.錯誤，我認為因為數據幀太大了。

有沒有人想出這個問題的好方法？ 我對其他包裝的想法持開放態度！

更新了解決問題的代碼

pixels.latlon=cbind(runif(200000,min=-180, max=-120), runif(200000, min=50, max=85))
grwl.tibble = tibble(long=runif(20000,min=-180, max=-120), lat=runif(20000, min=50, max=85), id=runif(20000, min=0, max=20000))
rnum <- apply(pixels.latlon, 1, function(x) {
    xlon=x[1]
    xlat=x[2]
    grwl.filt = grwl.tibble %>% 
      filter(long < (xlon+0.3) & long >(xlon-0.3) & lat < (xlat+0.3)&lat >(xlat-.3))
    grwl.latlon.filt = cbind(grwl.filt$long, grwl.filt$lat)
    dm <- distm(x, grwl.latlon.filt, fun=distHaversine)
    rnum=apply(dm, 1, which.min)
    id = grwl.filt$id[rnum]
    return(id)
                     })

Answer 1

你可以使用這個R（cpp）函數：

#include <Rcpp.h>
using namespace Rcpp;

double compute_a(double lat1, double long1, double lat2, double long2) {

  double sin_dLat = ::sin((lat2 - lat1) / 2);
  double sin_dLon = ::sin((long2 - long1) / 2);

  return sin_dLat * sin_dLat + ::cos(lat1) * ::cos(lat2) * sin_dLon * sin_dLon;
}

int find_min(double lat1, double long1,
             const NumericVector& lat2,
             const NumericVector& long2,
             int current0) {

  int m = lat2.size();
  double lat_k, lat_min, lat_max, a, a0;
  int k, current = current0;

  a0 = compute_a(lat1, long1, lat2[current], long2[current]);
  // Search before current0
  lat_min = lat1 - 2 * ::asin(::sqrt(a0));
  for (k = current0 - 1; k >= 0; k--) {
    lat_k = lat2[k];
    if (lat_k > lat_min) {
      a = compute_a(lat1, long1, lat_k, long2[k]);
      if (a < a0) {
        a0 = a;
        current = k;
        lat_min = lat1 - 2 * ::asin(::sqrt(a0));
      }
    } else {
      // No need to search further
      break;
    }
  }
  // Search after current0
  lat_max = lat1 + 2 * ::asin(::sqrt(a0));
  for (k = current0 + 1; k < m; k++) {
    lat_k = lat2[k];
    if (lat_k < lat_max) {
      a = compute_a(lat1, long1, lat_k, long2[k]);
      if (a < a0) {
        a0 = a;
        current = k;
        lat_max = lat1 + 2 * ::asin(::sqrt(a0));
      }
    } else {
      // No need to search further
      break;
    }
  }

  return current;
} 

// [[Rcpp::export]]
IntegerVector find_closest_point(const NumericVector& lat1,
                                 const NumericVector& long1,
                                 const NumericVector& lat2,
                                 const NumericVector& long2) {

  int n = lat1.size();
  IntegerVector res(n);

  int current = 0;
  for (int i = 0; i < n; i++) {
    res[i] = current = find_min(lat1[i], long1[i], lat2, long2, current);
  }

  return res; // need +1
}


/*** R
N <- 2000  # 2e6
M <- 500   # 2e4

pixels.latlon=cbind(runif(N,min=-180, max=-120), runif(N, min=50, max=85))
grwl.latlon=cbind(runif(M,min=-180, max=-120), runif(M, min=50, max=85))
# grwl.latlon <- grwl.latlon[order(grwl.latlon[, 2]), ]

library(geosphere)
system.time({
  #calculate the distance matrix
  dist.matrix = distm(pixels.latlon, grwl.latlon, fun=distHaversine)
  #Pick out the indices of the minimum distance
  rnum=apply(dist.matrix, 1, which.min)
})


find_closest <- function(lat1, long1, lat2, long2) {

  toRad <- pi / 180
  lat1  <- lat1  * toRad
  long1 <- long1 * toRad
  lat2  <- lat2  * toRad
  long2 <- long2 * toRad

  ord1  <- order(lat1)
  rank1 <- match(seq_along(lat1), ord1)
  ord2  <- order(lat2)

  ind <- find_closest_point(lat1[ord1], long1[ord1], lat2[ord2], long2[ord2])

  ord2[ind + 1][rank1]
}

system.time(
  test <- find_closest(pixels.latlon[, 2], pixels.latlon[, 1], 
                       grwl.latlon[, 2], grwl.latlon[, 1])
)
all.equal(test, rnum)

N <- 2e4
M <- 2e4
pixels.latlon=cbind(runif(N,min=-180, max=-120), runif(N, min=50, max=85))
grwl.latlon=cbind(long = runif(M,min=-180, max=-120), lat = runif(M, min=50, max=85))
system.time(
  test <- find_closest(pixels.latlon[, 2], pixels.latlon[, 1], 
                       grwl.latlon[, 2], grwl.latlon[, 1])
)
*/

它需要0.5秒為N = 2e4和4.2秒為N = 2e5 。 我無法讓你的代碼進行比較。

Answer 2

這將使用更少的內存，因為它一次一行，而不是創建完整的距離矩陣（雖然它會更慢）

library(geosphere)
rnum <- apply(pixels.latlon, 1, function(x) {
                     dm <- distm(x, grwl.latlon, fun=distHaversine)
                     return(which.min(dm))
                     })

大部分時間都采用復雜的Haversine公式。 由於您真的只想找到最近的點，而不是確切的距離，我們可以使用更簡單的距離測量。 以下是使用基於本文的公式http://jonisalonen.com/2014/computing-distance-between-coordinates-can-be-simple-and-fast/ ，並使用二次近似余弦的公式（這本身很難計算）...

#quadratic cosine approximation using lm (run once)
qcos <- lm(y~x+I(x^2), data.frame(x=0:90, y=cos((0:90)*2*pi/360)))$coefficients
cosadj <- function(lat) qcos[1]+lat*(qcos[2]+qcos[3]*lat)

#define rough dist function
roughDist <- function(x,y){#x should be a single (lon,lat), y a (n*2) matrix of (lon,lat)
            latDev <- x[2]-y[,2]
            lonDev <- (x[1]-y[,1])*cosadj(abs(x[2]))
            return(latDev*latDev+lonDev*lonDev) #don't need the usual square root or any scaling parameters
            }

然后你可以用這個新功能取代Haversine ......

rnum <- apply(pixels.latlon, 1, function(x) {
                     dm <- distm(x, grwl.latlon, fun=roughDist)
                     return(which.min(dm))
                     })

在我的機器上，這比Haversine版本快三倍。

R：大數據的區別？計算兩個矩陣之間的最小距離

問題描述

更新了解決問題的代碼

2 個解決方案

解決方案1
7 已采納 2018-04-16 22:04:32

解決方案2
1 2018-04-16 17:51:43

R：大數據的區別？ 計算兩個矩陣之間的最小距離

問題描述

更新了解決問題的代碼

2 個解決方案

解決方案1 7 已采納 2018-04-16 22:04:32

解決方案2 1 2018-04-16 17:51:43

R：大數據的區別？計算兩個矩陣之間的最小距離

解決方案1
7 已采納 2018-04-16 22:04:32

解決方案2
1 2018-04-16 17:51:43