简体   繁体   English

R:具有大内存的distm

[英]R: distm with Big Memory

I am trying to use bigmemory in R to compute distance matrices for more than 100,00,000 (rough estimate) rows and 16 columns 我正在尝试在R中使用bigmemory计算超过100,000,000行(粗略估计)和16列的距离矩阵

A small subset of the data looks like this 一小部分数据看起来像这样

list1 <- data.frame(longitude = c(80.15998, 72.89125, 77.65032, 77.60599, 
                                  72.88120, 76.65460, 72.88232, 77.49186, 
                                  72.82228, 72.88871), 
                    latitude = c(12.90524, 19.08120, 12.97238, 12.90927, 
                                 19.08225, 12.81447, 19.08241, 13.00984,
                                 18.99347, 19.07990))
list2 <- data.frame(longitude = c(72.89537, 77.65094, 73.95325, 72.96746, 
                                  77.65058, 77.66715, 77.64214, 77.58415,
                                  77.76180, 76.65460), 
                    latitude = c(19.07726, 13.03902, 18.50330, 19.16764, 
                                 12.90871, 13.01693, 13.00954, 12.92079,
                                 13.02212, 12.81447), 
                    locality = c("A", "A", "B", "B", "C", "C", "C", "D", "D", "E"))


library(geosphere)

# create distance matrix
mat <- distm(list1[,c('longitude','latitude')], list2[,c('longitude','latitude')], fun=distHaversine)

# assign the name to the point in list1 based on shortest distance in the matrix
list1$locality <- list2$locality[max.col(-mat)]

How can I use bigmemory to build massive dist matrices? 如何使用bigmemory来构建大量的dist矩阵?

Something like this works for me: 这样的事情对我有用:

library(bigmemory)
library(foreach)

CutBySize <- function(m, block.size, nb = ceiling(m / block.size)) {
  int <- m / nb
  upper <- round(1:nb * int)
  lower <- c(1, upper[-nb] + 1)
  size <- c(upper[1], diff(upper))
  cbind(lower, upper, size)
}

seq2 <- function(lims) {
  seq(lims[1], lims[2])
}

n <- nrow(list1)
a <- big.matrix(n, n, backingfile = "my_dist.bk",
                descriptorfile = "my_dist.desc")

intervals <- CutBySize(n, block.size = 1000)
K <- nrow(intervals)

doParallel::registerDoParallel(parallel::detectCores() / 2)
foreach(j = 1:K) %dopar% {
  ind_j <- seq2(intervals[j, ])
  foreach(i = j:K) %do% {
    ind_i <- seq2(intervals[i, ])
    tmp <- distm(list1[ind_i, c('longitude', 'latitude')], 
                 list2[ind_j, c('longitude', 'latitude')], 
                 fun = distHaversine)
    a[ind_i, ind_j] <- tmp
    a[ind_j, ind_i] <- t(tmp)
    NULL
  }
}
doParallel::stopImplicitCluster()

I repeated your list 1000 times to test with 10K rows. 我重复了您的列表1000次以测试10K行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 R:大数据的区别? 计算两个矩阵之间的最小距离 - R: Distm for big data? Calculating minimum distances between two matrices 在 R 中使用 distm 和应用函数解释结果 - Interpret the result using distm and apply function in R Distm function 用于计算 R 中坐标之间的距离 - Distm function for calculate distance between coordinates in R distm function 或 R 中的 distVincentyEllipsoid 之间的区别 - Difference between distm function or the distVincentyEllipsoid in R R中的大数据内存问题 - Big data memory issue in R 用R中的“大内存”包定义大矩阵 - Defining large matrix with "big memory" package in R 使用distm(distVincentyEllipsoid)将点子集(相同ID)之间的平均大地测量距离并将结果存储在R中的新数据框中 - Average geodetic distance between subsets of points (same ID) using distm(distVincentyEllipsoid) and storing the results in a new dataframe in R 使用R中的distm()计算数据帧中两个GPS位置之间的距离 - Calculating distance between two GPS locations in a data frame using distm () in R 修改大R data.frame时内存不足 - Out of memory when modifying a big R data.frame R stats - 分配大矩阵/ Linux时的内存问题 - R stats - memory issues when allocating a big matrix / Linux
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM