简体   繁体   English

如何更有效地存储距离矩阵?

[英]How to store a distance matrix more efficiently?

I have this python code to calculate coordinates distances among different points.我有这个 python 代码来计算不同点之间的坐标距离。

IDs,X,Y,Z
0-20,193.722,175.733,0.0998975
0-21,192.895,176.727,0.0998975
7-22,187.065,178.285,0.0998975
0-23,192.296,178.648,0.0998975
7-24,189.421,179.012,0.0998975
8-25,179.755,179.347,0.0998975
8-26,180.436,179.288,0.0998975
7-27,186.453,179.2,0.0998975
8-28,178.899,180.92,0.0998975

The code works perfectly, but as the amount of coordinates I now have is very big (~50000) I need to optimise this code, otherwise is impossible to run.该代码运行良好,但是由于我现在拥有的坐标量非常大(〜50000),我需要优化此代码,否则无法运行。 Could someone suggest me a way of doing this that is more memory efficient?有人可以建议我一种更高效的 memory 方法吗? Thanks for any suggestion.感谢您的任何建议。

#!/usr/bin/env python
import pandas as pd
import scipy.spatial as spsp

df_1 =pd.read_csv('Spots.csv', sep=',')
coords = df_1[['X', 'Y', 'Z']].to_numpy()
distances = spsp.distance_matrix(coords, coords)
df_1['dist'] = distances.tolist()

# CREATES columns d0, d1, d2, d3
dist_cols = df_1['IDs']
df_1[dist_cols] = df_1['dist'].apply(pd.Series)

df_1.to_csv("results_Spots.csv")

There are a couple of ways to save space.有几种方法可以节省空间。 The first is to only store the upper triangle of your matrix and make sure that your indices always reflect that.第一个是仅存储矩阵的上三角形,并确保您的索引始终反映这一点。 The second is only to store the values that meet your threshold.第二个是仅存储满足阈值的值。 This can be done collectively by using sparse matrices, which support most of the operations you will likely need, and will only store the elements you need.这可以通过使用稀疏矩阵共同完成,它支持您可能需要的大部分操作,并且只存储您需要的元素。

To store half the data, preprocess your indices when you access your matrix.要存储一半数据,请在访问矩阵时预处理索引。 So for your matrix, access index [i, j] like this:因此,对于您的矩阵,访问索引[i, j]如下所示:

getitem(A, i, j):
    if i > j:
        i, j = j, i
    return dist[i, j]

scipy.sparse supports a number of sparse matrix formats: BSR , Coordinate , CSR , CSC , Diagonal , DOK , LIL .scipy.sparse支持多种稀疏矩阵格式: BSR , Coordinate , CSR , CSC , Diagonal , DOK , LIL According to the usage reference , the easiest way to construct a matrix is using DOK or LIL format.根据使用参考,构造矩阵最简单的方法是使用 DOK 或 LIL 格式。 I will show the latter for simplicity, although the former may be more efficient.为简单起见,我将展示后者,尽管前者可能更有效。 I will leave it up to the reader to benchmark different options once a basic functioning approach has been shown.一旦展示了基本的功能方法,我将留给读者对不同的选项进行基准测试。 Remember to convert to CSR or CSC format when doing matrix math.记得在做矩阵数学时转换为 CSR 或 CSC 格式。

We will sacrifice speed for spatial efficiency by constructing one row at a time:我们将通过一次构建一行来牺牲空间效率:

N = coords.shape[0]
threshold = 2

threshold2 = threshold**2  # minor optimization to save on some square roots
distances = scipy.sparse.lil_matrix((N, N))
for i in range(N):
    # Compute square distances
    d2 = np.sum(np.square((coords[i + 1:, :] - coords[i])), axis=1)
    # Threshold
    mask = np.flatnonzero(d2 <= threshold2)
    # Apply, only compute square root if necessary
    distances[i, mask + i + 1] = np.sqrt(d2[mask])

For your toy example, we find that there are only four elements that actually pass threshold, making the storage very efficient:对于您的玩具示例,我们发现实际上只有四个元素通过了阈值,从而使存储非常高效:

>>> distances.nnz
4
>>> distances.toarray()
array([[0.        , 1.29304486, 0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 1.1008038 , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 0.68355102, 0.        , 1.79082802],
       [0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 0.        , 0.        ]])

Using the result from scipy.spatial.distance_matrix confirms that these numbers are in fact accurate.使用scipy.spatial.distance_matrix的结果确认这些数字实际上是准确的。

If you want to fill the matrix (effectively doubling the storage, which should not be prohibitive), you should probably move away from LIL format before doing so.如果您想填充矩阵(有效地将存储量翻倍,这不应该是禁止的),您可能应该在这样做之前远离 LIL 格式。 Simply add the transpose to the original matrix to fill it out.只需将转置添加到原始矩阵即可填充它。

The approach shown here addresses your storage concerns, but you can improve the efficiency of the entire computation using spatial sorting and other geospatial techniques.此处显示的方法解决了您的存储问题,但您可以使用空间排序和其他地理空间技术提高整个计算的效率。 For example, you could use scipy.spatial.KDTree or the similar scipy.spatial.cKDTree to arrange your dataset and query neighbors within a specific threshold directly and efficiently.例如,您可以使用scipy.spatial.KDTree或类似的scipy.spatial.cKDTree在特定阈值内直接有效地排列数据集和查询邻居。

For example, the following would replace the matrix construction shown here with what is likely a more efficient method:例如,以下将用可能更有效的方法替换此处显示的矩阵构造:

tree = scipy.spatial.KDTree(coords)
distances = tree.sparse_distance_matrix(tree, threshold)

You are asking in your code for point to point distances in a ~50000 x ~50000 matrix.您在代码中询问 ~50000 x ~50000 矩阵中的点到点距离。 The result will be very big, if you really like to store it.如果您真的喜欢存储它,结果将非常大。 The matrix is dense as each point has a positive distance to each other point.矩阵是密集的,因为每个点与其他点之间的距离为正。 I recommend to revisit your business requirements.我建议重新审视您的业务需求。 Do you really need to calculate all these points upfront and store them in a file on a disk?您真的需要预先计算所有这些点并将它们存储在磁盘上的文件中吗? Sometimes it is better to do the required calculations on the fly;有时最好即时进行所需的计算; scipy.spacial is fast, perhaps even not much slower then reading a precalculated value. scipy.spacial 很快,甚至可能比读取预先计算的值慢很多。

EDIT (based on comment): You can filter calculated distances by a threshold (here for illustration: 5.0) and then look up the IDs in the DataFrame编辑(基于评论):您可以按阈值过滤计算的距离(此处用于说明:5.0),然后在 DataFrame 中查找 ID

import pandas as pd
import scipy.spatial as spsp

df_1 =pd.read_csv('Spots.csv', sep=',')
coords = df_1[['X', 'Y', 'Z']].to_numpy()
distances = spsp.distance_matrix(coords, coords)

adj_5 = np.argwhere(distances[:] < 5.0)
pd.DataFrame(zip(df_1['IDs'][adj_5[:,0]].values,
                 df_1['IDs'][adj_5[:,1]].values),
             columns=['from', 'to'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM