简体   繁体   English

存储高效存储大距离矩阵

[英]Memory-efficient storage of large distance matrices

I have to create a data structure to store distances from each point to every other point in a very large array of 2d-coordinates. 我必须创建一个数据结构来存储从一个非常大的2d坐标数组中的每个点到每个其他点的距离。 It's easy to implement for small arrays, but beyond about 50,000 points I start running into memory issues -- not surprising, given that I'm creating an nxn matrix. 对于小型阵列来说很容易实现,但超过大约50,000个点我开始遇到内存问题 - 这并不奇怪,因为我正在创建一个nxn矩阵。

Here's a simple example which works fine: 这是一个很好的简单示例:

import numpy as np
from scipy.spatial import distance 

n = 2000
arr = np.random.rand(n,2)
d = distance.cdist(arr,arr)

cdist is fast, but is inefficient in storage since the matrix is mirrored diagonally (eg d[i][j] == d[j][i] ). cdist很快,但是在存储方面效率很低,因为矩阵是对角镜像的(例如d[i][j] == d[j][i] )。 I can use np.triu(d) to convert to upper triangular, but the resulting square matrix still takes the same memory. 我可以使用np.triu(d)转换为上三角形,但生成的方形矩阵仍然采用相同的内存。 I also don't need distances beyond a certain cutoff, so that can be helpful. 我也不需要超出某个截止点的距离,因此这可能会有所帮助。 The next step is to convert to a sparse matrix to save memory: 下一步是转换为稀疏矩阵以节省内存:

from scipy import sparse

max_dist = 5
dist = np.array([[0,1,3,6], [1,0,8,7], [3,8,0,4], [6,7,4,0]])
print dist

array([[0, 1, 3, 6],
       [1, 0, 8, 7],
       [3, 8, 0, 4],
       [6, 7, 4, 0]])

dist[dist>=max_dist] = 0
dist = np.triu(dist)
print dist

array([[0, 1, 3, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 4],
       [0, 0, 0, 0]])

sdist = sparse.lil_matrix(dist)
print sdist

(0, 1)        1
(2, 3)        4
(0, 2)        3

The problem is getting to that sparse matrix quickly for a very large dataset. 对于非常大的数据集,问题是快速到达稀疏矩阵。 To reiterate, making a square matrix with cdist is the fastest way I know of to calculate distances between points, but the intermediate square matrix runs out of memory. 重申一下,使用cdist制作方阵是我所知道的计算点之间距离的最快方法,但中间方阵矩阵耗尽内存。 I could break it down into more manageable chunks of rows, but then that slows things down a lot. 我可以把它分解成更易于处理的行块,但随后会减慢很多。 I feel like I'm missing some obvious easy way to go directly to a sparse matrix from cdist . 我觉得我错过了一些从cdist直接转到稀疏矩阵的简单方法。

Here is how to do it with a KDTree : 以下是如何使用KDTree执行此KDTree

>>> import numpy as np
>>> from scipy import sparse
>>> from scipy.spatial import cKDTree as KDTree
>>> 
# mock data
>>> a = np.random.random((50000, 2))
>>> 
# make tree
>>> A = KDTree(a)
>>> 
# list all pairs within 0.05 of each other in 2-norm
# format: (i, j, v) - i, j are indices, v is distance
>>> D = A.sparse_distance_matrix(A, 0.05, p=2.0, output_type='ndarray')
>>> 
# only keep upper triangle
>>> DU = D[D['i'] < D['j']]
>>> 
# make sparse matrix
>>> result = sparse.coo_matrix((DU['v'], (DU['i'], DU['j'])), (50000, 50000))
>>> result
<50000x50000 sparse matrix of type '<class 'numpy.float64'>'
        with 9412560 stored elements in COOrdinate format>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM